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Abstract 



Many statistical M-estiniators are based on convex optimization problems formed by the 
l/^ ' combination of a data-dependent loss function with a norm-based regularizer. We analyze the 

04 I convergence rates of projected gradient and composite gradient methods for solving such prob- 

lems, working within a high-dimensional framework that allows the data dimension d to grow 
^ ^ I with (and possibly exceed) the sample size n. This high-dimensional structure precludes the 

usual global assumptions — namely, strong convexity and smoothness conditions — that underlie 
much of classical optimization analysis. We define appropriately restricted versions of these 
conditions, and show that they are satisfied with high probability for various statistical models. 
Under these conditions, our theory guarantees that projected gradient descent has a globally 
geometric rate of convergence up to the statistical precision of the model, meaning the typical 
distance between the true unknown parameter 9* and an optimal solution 9. This result is 

■ substantially sharper than previous convergence results, which yielded sublinear convergence, 
or linear convergence only up to the noise level. Our analysis applies to a wide range of M- 

■ estimators and statistical models, including sparse linear regression using Lasso (^i-regularized 
OO ! regression); group Lasso for block sparsity; log-linear models with regularization; low-rank ma- 

I trix recovery using nuclear norm regularization; and matrix decomposition. Overall, our analy- 

I sis reveals interesting connections between statistical precision and computational efficiency in 

I high-dimensional estimation. 



1 Introduction 



High-dimensional data sets present challenges that are both statistical and computational in nature. 
. On the statistical side, recent years have witnessed a flurry of results on consistency and rates 

for various estimators under non-asymptotic high-dimensional scaling, meaning that error bounds 
are provided for general settings of the sample size n and problem dimension d, allowing for the 
possibility that d ^ n. These results typically involve some assumption regarding the underlying 
structure of the parameter space, such as sparse vectors, structured covariance matrices, low-rank 
matrices, or structured regression functions, as well as some regularity conditions on the data- 
generating process. On the computational side, many estimators for statistical recovery are based on 
solving convex programs. Examples of such M-estimators include ^i-regularized quadratic programs 
(also known as the Lasso) for sparse linear regression (e.g., see the papers [HI HSj HSj ETJ (U M, US] 
and references therein), second-order cone programs (SOCP) for the group Lasso (e.g., [36 l [25l [20] 
and references therein), and semidefinite programming relaxations (SDP) for various problems, 
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including sparse PCA and low-rank matrix estimation (e.g., [HI [361 001 [21 [381 [291 [37] and references 
therein). 

Many of these programs are instances of convex conic programs, and so can (in principle) be 
solved to e-accuracy in polynomial time using interior point methods, and other standard methods 
from convex programming (e.g., see the books [5, 7J). However, the complexity of such quasi- 
Newton methods can be prohibitively expensive for the very large-scale problems that arise from 
high-dimensional data sets. Accordingly, recent years have witnessed a renewed interest in simpler 
first-order methods, among them the methods of projected gradient descent and mirror descent. 
Several authors (e.g., I11[2T1[3]) have used variants of Nesterov's accelerated gradient method [32] 
to obtain algorithms for high-dimensional statistical problems with a sublinear rate of convergence. 
Note that an optimization algorithm, generating a sequence of iterates {9^}^q, is said to exhibit 
sublinear convergence to an optimum 6 if the optimization error — ^|| decays at the rate l/f^, 
for some exponent k > and norm || • ||. Although this type of convergence is quite slow, it is 
the best possible with gradient descent-type methods for convex programs under only Lipschitz 
conditions [31]. 

It is known that much faster global rates — in particular, a linear or geometric rate — can be 
achieved if global regularity conditions like strong convexity and smoothness are imposed [31j. An 
optimization algorithm is said to exhibit linear or geometric convergence if the optimization error 
\\6* — 6\\ decays at a rate k*, for some contraction coefficient k G (0,1). Note that such conver- 
gence is exponentially faster than sub-linear convergence. For certain classes of problems involving 
polyhedral constraints and global smoothness, Tseng and Luo [26j have established geometric con- 
vergence. However, a challenging aspect of statistical estimation in high dimensions is that the 
underlying optimization problems can never be strongly convex in a global sense when d > n (since 
the d X d Hessian matrix is rank-deficient), and global smoothness conditions cannot hold when 
d/n ^ +00. Some more recent work has exploited structure specific to the optimization prob- 
lems that arise in statistical settings. For the special case of sparse linear regression with random 
isotropic designs (also referred to as compressed sensing), some authors have established fast con- 
vergence rates in a local sense, meaning guarantees that apply once the iterates are close enough to 
the optimum [HI |T8] . The intuition underlying these results is that once an algorithm identifies the 
support set of the optimal solution, the problem is then effectively reduced to a lower-dimensional 
subspace, and thus fast convergence can be guaranteed in a local sense. Also in the setting of 
compressed sensing, Tropp and Gilbert [I2j studied finite convergence of greedy algorithms based 
on thresholding techniques, and showed linear convergence up to a certain tolerance. For the same 
class of problems, Garg and Khandekar [17] showed that a thresholded gradient algorithm converges 
rapidly up to some tolerance. In both of these results, the convergence tolerance is of the order of 
the noise variance, and hence substantially larger than the true statistical precision of the problem. 

The focus of this paper is the convergence rate of two simple gradient-based algorithms for solv- 
ing optimization problems that underlie regularized M-estimators. For a constrained problem with 
a differentiable objective function, the projected gradient method generates a sequence of iterates 
{0*}^o by taking a step in the negative gradient direction, and then projecting the result onto the 
constraint set. The composite gradient method of Nesterov [32] is well-suited to solving regularized 
problems formed by the sum of a differentiable and (potentially) non-differentiable component. 
The main contribution of this paper is to establish a form of global geometric convergence for 
these algorithms that holds for a broad class of high-dimensional statistical problems. In order to 
provide intuition for this guarantee. Figure [1] shows the performance of projected gradient descent 
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for a Lasso problem (£i-constrained least-squares). In panel (a), we have plotted the logarithm 
of the optimization error, measured in terms of the Euclidean norm — ^|| between the current 
iterate 9^ and an optimal solution 9, versus the iteration number t. The plot includes three different 
curves, corresponding to sparse regression problems in dimension d £ {5000,10000,20000}, and a 
fixed sample size n = 2500. Note that all curves are linear (on this logarithmic scale), revealing 
the geometric convergence predicted by our theory. Such convergence is not predicted by classi- 
cal optimization theory, since the objective function cannot be strongly convex whenever n < d. 
Moreover, the convergence is geometric even at early iterations, and takes place to a precision far 
less than the noise level (z^^ = 0.25 in this example). We also note that the design matrix does not 
satisfy the restricted isometry property, as assumed in some past work. 
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Figure 1. Convergence rates of projected gradient descent in application to Lasso programs {£i- 
constrained least-squares). Each panel shows the log optimization error log — 6'|| versus the itera- 
tion number t. Panel (a) shows three curves, corresponding to dimensions d e {5000, 10000, 20000}, 
sparsity s = \\/d~\ , and all with the same sample size n — 2500. All cases show geometric con- 
vergence, but the rate for larger problems becomes progressively slower, (b) For an appropriately 
rescaled sample size {a — ^ j ^ ) , all three convergence rates should be roughly the same, as predicted 
by the theory. 



The results in panel (a) exhibit an interesting property: the convergence rate is dimension- 
dependent, meaning that for a fixed sample size, projected gradient descent converges more slowly 
for a large problem than a smaller problem — compare the squares for d = 20000 to the diamonds for 
d = 5000. This phenomenon reflects the natural intuition that larger problems are, in some sense, 
"harder" than smaller problems. A notable aspect of our theory is that in addition to guaranteeing 
geometric convergence, it makes a quantitative prediction regarding the extent to which a larger 
problem is harder than a smaller one. In particular, our convergence rates suggest that if the 
sample size n is re-scaled in a certain way according to the dimension d and also other model 
parameters such as sparsity, then convergence rates should be roughly similar. Panel (b) provides 
a confirmation of this prediction: when the sample size is rescaled according to our theory (in 
particular, see Corollary [2] in Section [3.2p . then all three curves lie essentially on top of another. 

Although high-dimensional optimization problems are typically neither strongly convex nor 
smooth, this paper shows that it is fruitful to consider suitably restricted notions of strong con- 



3 



vexity and smoothness. Our notion of restricted strong convexity (RSC) is related to but slightly 
different than that introduced in a recent paper by Negahban et al. |28j for establishing statis- 
tical consistency. As we discuss in the sequel, bounding the optimization error introduces new 
challenges not present when analyzing the statistical error. We also introduce a related notion of 
restricted smoothness (RSM), not needed for proving statistical rates but essential in the setting 
of optimization. Our analysis consists of two parts. We first show that for optimization problems 
underlying many regularized M-estimators, appropriately modified notions of restricted strong 
convexity (RSC) and smoothness (RSM) are sufficient to guarantee global linear convergence of 
projected gradient descent. Our second contribution is to prove that for the iterates generated by 
our first-order method, these RSC /RSM assumptions do indeed hold with high probability for a 
broad class of statistical models, among them sparse linear models, models with group sparsity 
constraints, and various classes of matrix estimation problems, including matrix completion and 
matrix decomposition. 

An interesting aspect of our results is that the global geometric convergence is not guaranteed 
to an arbitrary numerical precision, but only to an accuracy related to statistical precision of 
the problem. For a given error norm || • ||, given by the Euclidean or Frobenius norm for most 
examples in this paper, the statistical precision is given by the mean-squared error ]E[||0 — 
between the true parameter 9* and the estimate 9 obtained by solving the optimization problem, 
where the expectation is taken over randomness in the statistical model. Note that this is very 
natural from the statistical perspective, since it is the true parameter 9* itself (as opposed to the 
solution 9 of the M-estimator) that is of primary interest, and our analysis allows us to approach 
it as close as is statistically possible. Our analysis shows that we can geometrically converge 
to a parameter 6 such that \\6 — 9*\\ = \\9 — 9*\\ + 0(116* — ^*||), which is the best we can hope 
for statistically, ignoring lower order terms. Overall, our results reveal an interesting connection 
between the statistical and computational properties of M-estimators — that is, the properties of 
the underlying statistical model that make it favorable for estimation also render it more amenable 
to optimization procedures. 

The remainder of this paper is organized as follows. We begin in Section [2] with a precise 
formulation of the class of convex programs analyzed in this paper, along with background on the 
notions of a decomposable regularizer, and properties of the loss function. Section [3] is devoted to 
the statement of our main convergence result, as well as to the development and discussion of its 
various corollaries for specific statistical models. In Section IH we provide a number of empirical 
results that confirm the sharpness of our theoretical predictions. Finally, Section [5] contains the 
proofs, with more technical aspects of the arguments deferred to the Appendix. 

2 Background and problem formulation 

In this section, we begin by describing the class of regularized M-estimators to which our analysis 
applies, as well as the optimization algorithms that we analyze. Finally, we introduce some im- 
portant notions that underlie our analysis, including the notions of a decomposable regularization, 
and the properties of restricted strong convexity and smoothness. 



4 



2.1 Loss functions, regularization and gradient-based methods 

Given a random variable Z ~ P taking values in some set let = {Zi, . . . , Z„} be a collection 
of n observations. Here the integer n is the sample size of the problem. Assuming that P lies 
within some indexed family {Pg ,6 £ 0,}, the goal is to recover an estimate of the unknown true 
parameter 9* £ Q generating the data. Here 17 is some subset of M'^, and the integer d is known as 
the ambient dimension of the problem. In order to measure the "fit" of any given parameter E 
to a given data set Zf, we introduce a loss function : ^2 x — )• By construction, for any 
given n-sample data set G Z'^, the loss function assigns a cost Cn{0] Z") > to the parameter 
9 £ Q. In many (but not all) applications, the loss function has a separable structure across the 
data set, meaning that £„(0;Z{') = ^ ^(^; Zj) where £ : Q x Z M_|_ is the loss function 
associated with a single data point. 

Of primary interest in this paper are estimation problems that are under-determined, meaning 
that the number of observations n is smaller than the ambient dimension d. In such settings, 
without further restrictions on the parameter space fi, there are various impossibility theorems, 
asserting that consistent estimates of the unknown parameter 9* cannot be obtained. For this 
reason, it is necessary to assume that the unknown parameter 9* either lies within a smaller subset 
of Q, or is well-approximated by some member of such a subset. In order to incorporate these types 
of structural constraints, we introduce a regularizer TZ : 0, ^ M-|_ over the parameter space. With 
these ingredients, the analysis of this paper applies to the constrained M-estimator 

dpeavg min {Cn{9;Z^)}, (1) 

7?.(9)<p 

where p > is a user-defined radius, as well as to the regularized M-estimator 

9x„ G arg min {Cn{9; Z'l) + A„7^(^)} (2) 
'R.{e)<p ^ 

where the regularization weight A„ > is user-defined. Note that the radii p and p may be different 
in general. Throughout this paper, we impose the following two conditions: 

(a) for any data set Z", the function Zf) is convex and differentiable over Q, and 

(b) the regularizer 7^ is a norm. 

These conditions ensure that the overall problem is convex, so that by Lagrangian duality, the 
optimization problems ([T]) and ([2]) are equivalent. However, as our analysis will show, solving one 
or the other can be computationally more preferable depending upon the assumptions made. Some 
remarks on notation: when the radius p or the regularization parameter A„ is clear from the context, 
we will drop the subscript on 9 to ease the notation. Similarly, we frequently adopt the shorthand 
Cn{9)., with the dependence of the loss function on the data being implicitly understood. Procedures 
based on optimization problems of either form are known as M-estimators in the statistics literature. 

The focus of this paper is on two simple algorithms for solving the above optimization problems. 
The method of projected gradient descent applies naturally to the constrained problem ([1]) , whereas 
the composite gradient descent method due to Nesterov [32| is suitable for solving the regularized 
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problem ([2|). Each routine generates a sequence {^*}^q of iterates by first initializing to some 
parameter 6^ G $7, and then applying the recursive update 

0*+i = arg min {£^{6') + {V£n{0'), 6 - O') + ^\\d - , for t = 0, 1, 2, . . ., (3) 

in the case of projected gradient descent, or the update 
0*+i = arg min {£„(0*) + (V£„(^*), - 0*) + ^||^ - + A„7^(0)}, for t = 0, 1, 2, . . ., 

(4) 

for the composite gradient method. Note that the only difference between the two updates is the 
addition of the regularization term in the objective. These updates have a natural intuition: the 
next iterate 0*"*"^ is obtained by constrained minimization of a first-order approximation to the loss 
function, combined with a smoothing term that controls how far one moves from the current iterate 
in terms of Euclidean norm. Moreover, it is easily seen that the update ^ is equivalent to 

e'+' = u(^9'-^vCn{e')^, (5) 

where 11 = H^^^p) denotes Euclidean projection onto the ball Mti{p) = {9 £ Q \ TZ{9) < p} of 
radius p. In this formulation, we see that the algorithm takes a step in the negative gradient 
direction, using the quantity l/7n as stepsize parameter, and then projects the resulting vector 
onto the constraint set. The update takes an analogous form, however, the projection will 
depend on both An and ju- As will be illustrated in the examples to follow, for many problems, 
the updates ([3]) and dH, or equivalently (l5|), have a very simple solution. For instance, in the case 
of ^i-regularization, it can be obtained by an appropriate form of the soft-thresholding operator. 



2.2 Restricted strong convexity and smoothness 

In this section, we define the conditions on the loss function and regularizer that underlie our 
analysis. Global smoothness and strong convexity assumptions play an important role in the 
classical analysis of optimization algorithms O [3 [31] . In application to a differentiable loss function 
Cn, both of these properties are defined in terms of a first-order Taylor series expansion around a 
vector 9' in the direction of 9 — namely, the quantity 

Tc{9; 9') := C^{9) - - (V£„(0'), 0-9'). (6) 

By the assumed convexity of £„, this error is always non-negative, and global strong convexity 
is equivalent to imposing a stronger condition, namely that for some parameter 7^ > 0, the first- 
order Taylor error Tc{9;9') is lower bounded by a quadratic term ^ 11^ ~ ^'IP for all 9,9' £ Q. 
Global smoothness is defined in a similar way, by imposing a quadratic upper bound on the Taylor 
error. It is known that under global smoothness and strong convexity assumptions, the method of 
projected gradient descent ([3]) enjoys a globally geometric convergence rate, meaning that there is 
some KG (0,1) such tha10 

ll^'* - of < \\9^ - 9f for all iterations t = 0, 1, 2, . . .. (7) 

^In this statement (and throughout the paper), we use < to mean an inequality that holds with some universal 
constant c, independent of the problem parameters. 
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We refer the reader to Bertsekas [5l Prop. 1.2.3, p. 145], or Nesterov [311 Thm. 2.2.8, p. 88] for 
such results on projected gradient descent, and to Nesterov [32] for composite gradient descent. 

Unfortunately, in the high-dimensional setting (d > n), it is usually impossible to guarantee 
strong convexity of the problem ([T]) in a global sense. For instance, when the data is drawn i.i.d., 
the loss function consists of a sum of n terms. If the loss is twice differentiable, the resulting d x d 
Hessian matrix \/'^C{9; Z^) is often a sum of n matrices each with rank one, so that the Hessian is 
rank-degenerate when n < d. However, as we show in this paper, in order to obtain fast conver- 
gence rates for the optimization method ([3]), it is sufficient that (a) the objective is strongly convex 
and smooth in a restricted set of directions, and (b) the algorithm approaches the optimum 9 only 
along these directions. Let us now formalize these ideas. 

Definition 1 (Restricted strong convexity (RSC)). The loss function Cn satisfies restricted 
strong convexity with respect to TZ and with parameters (7£,Tf(i2„)) over the set Q' if 



We refer to the quantity ji as the (lower) curvature parameter, and to the quantity ti as the 
tolerance parameter. The set 0,' corresponds to a suitably chosen subset of the space J7 of all 
possible parameters. 

In order to gain intuition for this definition, first suppose that the condition ([8]) holds with 
tolerance parameter Tg = 0. In this case, the regularizer plays no role in the definition, and 
condition ([8|) is equivalent to the usual definition of strong convexity on the optimization set $7. As 
discussed previously, this type of global strong convexity typically fails to hold for high-dimensional 
inference problems. In contrast, when tolerance parameter is strictly positive, the condition ([8]) 
is much milder, in that it only applies to a limited set of vectors. For a given pair 9 ^ 9', consider 
the inequality 



If this inequality is violated, then the right-hand side of the bound ([8]) is non-positive, in which 
case the RSC constraint ([8]) is vacuous. Thus, restricted strong convexity imposes a non-trivial 
constraint only on pairs 9^9' for which the inequality ([8]) holds, and a central part of our analysis 
will be to prove that, for the sequence of iterates generated by projected gradient descent, the 
optimization error A* := 9^ — 9 satisfies a constraint of the form ([9]). We note that since the 
regularizer TZ is convex, strong convexity of the loss function Cn also implies the strong convexity 
of the regularized loss 0„ as well. 

For the least-squares loss, the RSC definition depends purely on the direction (and not the 
magnitude) of the difference vector 9 — 9' . For other types of loss functions — such as those arising 
in generalized linear models — it is essential to localize the RSC definition, requiring that it holds 
only for pairs for which the norm ||^ — 9'\\2 is not too large. We refer the reader to Section [2.4.11 
for further discussion of this issue. 

Finally, as pointed out by a reviewer, our restricted version of strong convexity can be seen 
as an instance of the general theory of paraconvexity (e.g., [33]); however, we are not aware of 
convergence rates for minimizing general paraconvex functions. 

We also specify an analogous notion of restricted smoothness: 




for all 9,9' G ft'. 



(8) 




(9) 
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Definition 2 (Restricted smoothness (RSM)). We say the loss function Cn satisfies restricted 
smoothness with respect to TZ and with parameters {^mTui^n)) over the set il' if 

Tc{e;e') <^\\9-e'f + TuiCn)n^{e-9') for aiie,e' en'. (lo) 



As with our definition of restricted strong convexity, the additional tolerance Tu{Cn) is not present 
in analogous smoothness conditions in the optimization literature, but it is essential in our set-up. 

2.3 Decomposable regularizers 

In past work on the statistical properties of regularization, the notion of a decomposable regularizer 
has been shown to be useful ^J. Although the focus of this paper is a rather different set of 
questions — namely, optimization as opposed to statistics — decomposability also plays an important 
role here. Decomposability is defined with respect to a pair of subspaces defined with respect to 
the parameter space C R'^. The set Jv[ is known as the model subspace, whereas the set A^"*", 
referred to as the perturbation subspace, captures deviations away from the model subspace. 

Definition 3. Given a subspace pair {A4,A4'^) such that M. Q M., we say that a norm TZ is 
A^"*") -decomposable if 

TZ{a + /3) = n{a) + n{/3) for all a e M and 13 £ A!"^ . (11) 

To gain some intuition for this definition, note that by triangle inequality, we always have the 
bound Tl{a + 13) < Tl{a) +Tl{(3). For a decomposable regularizer, this inequality always holds with 
equality. Thus, given a fixed vector a £ M, the key property of any decomposable regularizer is 
that it affords the maximum penalization of any deviation (3 £ Ai'^ . 

For a given error norm || • ||, its interaction with the regularizer TZ plays an important role in 
our results. In particular, we have the following: 

Definition 4 (Subspace compatibility). Given the regularizer TZ{-) and a norm \\-\\, the associated 
subspace compatibility is given by 

^{M) := sup when M / {0}, and ^-({0}) := 0. (12) 

The quantity ^{M) corresponds to the Lipschitz constant of the norm TZ with respect to || • ||, 
when restricted to the subspace M.. 

2.4 Some illustrative examples 

We now describe some particular examples of M-estimators with decomposable regularizers, and 
discuss the form of the projected gradient updates as well as RSC/RSM conditions. We cover 
two main families of examples: log-linear models with sparsity constraints and ^i-regularization 
(Section I2.4.ip . and matrix regression problems with nuclear norm regularization (Section I2.4.2p . 
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2.4.1 Sparse log-linear models and £i-regularization 

Suppose that each sample Zi consists of a scalar- vector pair {yi,Xi) G M x W^, corresponding to 
the scalar response yi ^ y associated with a vector of predictors Xi € W^. A log-linear model 
with canonical link function assumes that the response yi is linked to the covariate vector Xi via 

a conditional distribution of the form P(?/j | Xi;9*,a) oc exp< ^' '^'^ / ^S^^ '^'^^ >, where c{a) is 



a known quantity, <!>(•) is the log-partition function to normalize the density, and 6* G is an 
unknown regression vector. In many applications, the regression vector 9* is relatively sparse, so 
that it is natural to impose an -constraint. Computing the maximum likelihood estimate subject 
to such a constraint involves solving the convex program^ 

1 " 

9 e aigmin \ -S^ (9, Xi)) - yi (9, Xi)]\ such that ||6'||i < p, (13) 
een I n ^-^ ^ ^ ) 

i=l 

^ V ' 

with Xi S as its i^^ row. We refer to this estimator as the log-linear Lasso; it is a special case of 
the M-estimator ([T]), with the loss function £n{9; Z^) = ^ Yl7=i {^((^' ^«)) ~ Vi (^) ^«)} the 
regularizer TZ{9) = ||^||i = Ei=i l^l- 

Ordinary linear regression is the special case of the log- linear setting with $(t) = and 
r2 = W^, and in this case, the estimator (|13p corresponds to ordinary least-squares version of 
Lasso [13^ I41j . Other forms of log-linear Lasso that are of interest include logistic regression, 
Poisson regression, and multinomial regression. 



Projected gradient updates: Computing the gradient of the log-linear loss from equation (jl3p 
is straightforward: we have VCn{9) = ^ Z^iLi Xi{^^'{{9, Xi)) — y,}, and the update ([5]) corresponds 
to the Euclidean projection of the vector 9^ — ■^VCn{9^) onto the £i-ball of radius p. It is well- 
known that this projection can be characterized in terms of soft-thresholding, and that the pro- 
jected update ([5]) can be computed easily. We refer the reader to Duchi et al. [H] for an efficient 
implementation requiring 0{d) operations. 

Composite gradient updates: The composite gradient update for this problem amounts to 
solving 

0*+i = arg min \{9,V Cn{9)) + ^\\9 - 9'\\l + Xn\\9\\i\ . 

||6»l|i<p I. Z J 

The update can be computed by two soft-thresholding operations. The first step is soft thresolding 
the vector 0* — Cn{9^) at a level A^- If the resulting vector has £i-norm greater than p, then 
we project on to the ^i-ball just like before. Overall, the complexity of the update is still 0{d) as 
before. 

Decomposability of £i-norm: We now illustrate how the ^i-norm is decomposable with respect 
to appropriately chosen subspaces. For any subset S'C{l,2,...,(i}, consider the subspace 

M{S) := {aeM!^ \ aj = ^ for ah j ^ S}, (14) 



The link function $ is convex since it is the log-partition function of a canonical exponential family. 
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corresponding to all vectors supported only on S. Defining M.{S) = A4{S), its orthogonal comple- 
ment (with respect to the usual Euclidean inner product) is given by 

M^{S) = M^{S) = eM.'^ \ =0 for all j eS}. (15) 

To establish the decomposability of the £i-norm with respect to the pair {Ai{S) , (S)) , note that 
any a G Ai{S) can be written in the partitioned form a = {as, Ogc), where as € and O5C G M"^"* 
is a vector of zeros. Similarly, any vector /3 G 7W"'~(S') has the partitioned representation {Os^f^S")- 
With these representations, we have the decomposition 

||a + /3||i = ||(a5,0) + (0,/3s0lli = l|a||i + ll/3||i- 

Consequently, for any subset S, the ^i-norm is decomposable with respect to the pairs {Ai (S) , Ai-^ {S)). 

In analogy to the £i-norm, various types of group-sparse norms are also decomposable with 
respect to non-trivial subspace pairs. We refer the reader to the paper p8j for further discussion 
and examples of such decomposable norms. 

RSC/RSM conditions: A calculation using the mean- value theorem shows that for the loss 
function (jl3p . the error in the first-order Taylor series, as previously defined in equation ([6|), can 
be written as 

i=l 

where 9t = tO + {\ — t)9' for some t S [0, 1]. When n < d, then we can always find pairs 0^6' 
such that {xi, 9 — 9') = for alH = 1, 2, . . . , n, showing that the objective function can never be 
strongly convex. On the other hand, restricted strong convexity for log-linear models requires only 
that there exist positive numbers (7£,r^(£„)) such that 

- ^"{{^t, Xi)) {{xi, 9 - 9'))^ >^\\9- 9'f - T,{£n) n\9 - 9') for ah 9, 9' e 0', (16) 

i=l 

where O' := n M2{R) is the intersection of the parameter space with a Euclidean ball of some 
fixed radius R around zero. This restriction is essential because for many generalized linear models, 
the Hessian function <I>" approaches zero as its argument diverges. For instance, for the logistic 
function $(t) = log(l + exp(t)), we have ^"{t) = exp(t)/[l + exp(t)]^, which tends to zero as 
t — )• +00. Restricted smoothness imposes an analogous upper bound on the Taylor error. For a 
broad class of log-linear models, such bounds hold with tolerance Ti{Cn) and Tu{Cn) of the order 

Further details on such results are provided in the corollaries to follow our main theorem. 
A detailed discussion of RSC for exponential families in statistical problems can be found in the 
paper [2H]. 

In order to ensure RSC/RSM conditions on the iterates 6** of the updates ([3]) or (jl]), we also 
need to ensure that 6** G fi'. This can be done by defining i2„ = i2„ + ][q'(0), where In'(^) is zero 
when 9 ^ n' and 00 otherwise. This is equivalent to projection on the intersection of ^i-ball with fi' 
in the updates ^ and (jH and can be done efficiently with Dykstra's algorithm [15j, for instance, 
as long as the individual projections are efficient. 
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In the special case of linear regression, we have = 1 for all t G M, so that the lower 

bound ()16p involves only the Gram matrix Xjn. (Here X G M"^'^ is the usual design matrix, 
with Xi G M"^ as its i*'* row.) For linear regression and ^i-regularization, the RSC condition is 
equivalent to the lower bound 

\\^^^-^')\^^ >1L\\Q- e'f^ - r,{Cn) \\9 - e'Wl for all 9, 6' G n. (17) 
n 2 

Such a condition corresponds to a variant of the restricted eigenvalue (RE) conditions that have 
been studied in the literature [6l[43]. Such RE conditions are significantly milder than the restricted 
isometry property; we refer the reader to van de Geer and Buhlmann [43j for an in-depth comparison 
of different RE conditions. Prom past work, the condition ()17p is satisfied with high probability for 
a broad classes of anisotropic random design matrices |341 [39] , and parts of our analysis make use 
of this fact. 



2.4.2 Matrices and nuclear norm regularization 

We now discuss a general class of matrix regression problems that falls within our framework. 
Consider the space of di x d2 matrices endowed with the trace inner product {{A, B)) := trace(^^S). 
In order to ease notation, we define d := minjdi, ^2}- Let G* G ]-,g unknown matrix and 

suppose that for i = 1,2, ... ,n, we observe a scalar-matrix pair Zi = {yi,Xi) £Rx Rdi>«i2 linked 
to 0* via the linear model 

Vi = {{X,, &*)) + Wi, for i = 1, 2, . . . , n, (18) 

where Wi is an additive observation noise. In many contexts, it is natural to assume that Q* is 
exactly low-rank, or approximately so, meaning that it is well-approximated by a matrix of low 
rank. In such settings, a number of authors (e.g., [TBI [38l [29] ) have studied the M-estimator 

1 " 

G G arg min |— V (y^ - {{Xi, G)))^ such that |||G|||i < p, (19) 
1=1 

d 

or the corresponding regularized version. Here the nuclear or trace norm is given by |||Q|||i := ^ji®)^ 

i=i 

corresponding to the sum of the singular values. This optimization problem is an instance of a 
semidefinite program. As discussed in more detail in Section 13. 3[ there are various applications in 
which this estimator and variants thereof have proven useful. 



Form of projected gradient descent: For the M-estimator p9p . the projected gradient updates 
take a very simple form — namely 

e-^- = n(9--l^"'fa-«^" ®'»)^-), (20) 

V 7„ n J 

where H denotes Euclidean projection onto the nuclear norm ball]Bi(p) = {Q G I |||Q|||i < p}. 

This nuclear norm projection can be obtained by first computing the singular value decomposition 
(SVD), and then projecting the vector of singular values onto the £i-ball. The latter step can be 
achieved by the fast projection algorithms discussed earlier, and there are various methods for fast 
computation of SVDs. The composite gradient update also has a simple form, requiring at most 
two singular value thresholding operations as was the case for linear regression. 
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Decomposability of nuclear norm: We now define matrix subspaces for which the nuclear 
norm is decomposable. Given a target matrix Q* — that is, a quantity to be estimated — consider its 
singular value decomposition Q* = UDV^ , where the matrix D G M'^^'^ is diagonal, with the ordered 
singular values of G* along its diagonal, and d := minjdi, ^2}- For an integer r E {1,2, . . . let 
G W^^^ denote the matrix formed by the top r left singular vectors of 0* in its columns, and 
we define the matrix in a similar fashion. Using col to denote the column span of a matrix, we 
then define the subspace^ 

M{U',V') := {Q G I col(G^) C coK^^), col(G) C col(;7'')}, and (21a) 

M^{U'',V) := {G G I col(G^) C (col(y"))^, col(G) C (col(C/"))^}. (21b) 

Finally, let us verify the decomposability of the nuclear norm . By construction, any pair of matrices 
G G Ai{U^ ,V^) and T G Ai'^ {If' jV') have orthogonal row and column spaces, which implies the 
required decomposability condition — namely |||Q + r|||i = |||Q|||i + |||r|||i. 

In some special cases such as matrix completion or matrix decomposition that we describe in the 
sequel, ^l' will involve an additional bound on the entries of G* as well as the iterates G* to establish 
RSC/RSM conditions. This can be done by augmenting the loss with an indicator of the constraint 
and using cyclic projections for computing the updates as mentioned earlier in Example 12.4.11 

3 Main results and some consequences 

We are now equipped to state the two main results of our paper, and discuss some of their con- 
sequences. We illustrate its application to several statistical models, including sparse regression 
(Section 13. 2p . matrix estimation with rank constraints (Section 13. 3p . and matrix decomposition 
problems (Section l3.4p . 

3.1 Geometric convergence 

Recall that the projected gradient algorithm ([3]) is well-suited to solving an M-estimation problem 
in its constrained form, whereas the composite gradient algorithm (jj]) is appropriate for a regular- 
ized problem. Accordingly, let 6 be any optimal solution to the constrained problem ([1]), or the 
regularized problem ([2]), and let {0*}^q be a sequence of iterates generated by generated by the 
projected gradient updates ([3]), or the the composite gradient updates respectively. Of primary 
interest to us in this paper are bounds on the optimization error, which can be measured either in 
terms of the error vector A* := ^* — 0, or the difference between the cost of 9* and the optimal cost 
defined by 6. In this section, we state two main results — -Theorems [T] and [2] — corresponding to 
the constrained and regularized cases respectively. In addition to the optimization error previously 
discussed, both of these results involve the statistical error A* := 9 — 9* between the optimum 9 
and the nominal parameter 9*. At a high level, these results guarantee that under the RSC/RSM 
conditions, the optimization error shrinks geometrically, with a contraction coefficient that depends 
on the the loss function via the parameters (7^ , Tt{Cn)) and (7^, Tu{Cn))- An interesting feature 
is that the contraction occurs only up to a certain tolerance depending on these same parameters, 
and the statistical error. However, as we discuss, for many statistical problems of interest, we can 

^ Note that the model space M[U^ , V^) is not equal to M{U^ , V^). Nonetheless, as required by Definition^ we 
do have the inclusion MiU' ,V'') <^ MiV ,V''). 
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show that this tolerance is of a lower order than the intrinsic statistical error, and hence can 
be neglected from the statistical point of view. Consequently, our theory gives an explicit upper 
bound on the number of iterations required to solve an M-estimation problem up to the statistical 
precision. 



Convergence rates for projected gradient: We now provide the notation necessary for a 
precise statement of this claim. Our main result actually involves a family of upper bounds on the 
optimization error, one for each pair {M.,Ai ) of 7^-decomposable subspaces (see Definition [3|) . As 
will be clarified in the sequel, this subspace choice can be optimized for different models so as to 
obtain the tightest possible bounds. For a given pair {M,M ) such that 16*2(>l)r„(£„) < 7^, 
let us define the contraction coefficient 

In addition, we define the tolerance parameter 

c\A*-M M) ■= 32K(/:n) + r,(/:n)) (27^(^^x(g*)) + ^I/(.M)||A*|| + 27^(A*))^ ^^^^ 

lu 

where A* = — 9* is the statistical error, and Ii^±{6*) denotes the Euclidean projection of 9* 
onto the subspace Ai-^. 



In terms of these two ingredients, we now state our first main result: 

Theorem 1. Suppose that the loss function Cn satisfies the RSC/RSM condition with parameters 
ile^Tii^n)) and (lujTui^n)) respectively. Let {M,M) be any IZ-decomposahle pair of suh spaces 
such that 7W C and < k = < 1. Then for any optimum 9 of the problem ([T|) for 

which the constraint is active, we have 

_ ef < 11^0 _ ^||2 ^ 1 i iterations t = 0, 1, 2, . . . . (24) 

1 — K 

Remarks: Theorem [1] actually provides a family of upper bounds, one for each 7^-decomposable 
pair {A4,Ai) such that < k = K,{Cn,M.) < 1- This condition is always satisfied by setting M. 
equal to the trivial subspace {0}: indeed, by definition (jl2p of the subspace compatibility, we have 
^{M) = 0, and hence k(£„;{0}) = (l - ^) < 1. Although this choice of M minimizes the 
contraction coefficient, it will leacfl to a very large tolerance parameter e"^ {A* ; A4 , M) . A more 
typical application of Theorem [1] involves non-trivial choices of the subspace M . 

The bound ()24p guarantees that the optimization error decreases geometrically, with contrac- 
tion factor K £ (0,1), up to a certain tolerance proportional to e^(A*; A^, A^), as illustrated in 
Figure [2^a). The contraction factor k approaches the 1 — n/lu as the number of samples grows. 
The appearance of the ratio ii/iu is natural since it measures the conditioning of the objective 
function; more specifically, it is essentially a restricted condition number of the Hessian matrix. On 

^Indeed, the setting = K'' means that the term 7^,(11^,^^ (^*)) ~ ^1(0*) appears in the tolerance; this quantity 
is far larger than statistical precision. 
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the other hand, the tolerance parameter e depends on the choice of decomposable subspaces, the 
parameters of the RSC/RSM conditions, and the statistical error A* = ^ — ^* (see equation (f23l) ). 
In the corollaries of Theorem [1] to follow, we show that the subspaces can often be chosen such that 
{A* ; A4 , Ai) = o{\\9 — 9*\\'^). Consequently, the bound (p^ guarantees geometric convergence up 
to a tolerance smaller than statistical precision, as illustrated in Figure [2]J^b). This is sensible, since 
in statistical settings, there is no point to optimizing beyond the statistical precision. 




(a) (b) 

Figure 2. (a) Generic illustration of Theorem[T] The optimization error A* — 9* — 6 is guaranteed to 
decrease geometrically with coefficient k e (0, 1), up to the tolerance = e^(A*; A^, A^), represented 
by the circle, (b) Relation between the optimization tolerance e'^{A*;M,M) (solid circle) and the 
statistical precision ||A*|| = \\9* — 9\\ (dotted circle). In many settings, we have e^(A*; A^,7W) ^ 
||A*|p, so that convergence is guaranteed up to a tolerance lower than statistical precision. 



The result of Theorem [T] takes a simpler form when there is a subspace Ai that includes 9*, and 
the 7?.-ball radius is chosen such that p < TZ{9*). In this case, by appropriately controlling the error 
term, we can establish that it is of lower order than the statistical precision — namely, the squared 
difference \\9 — 9*\\'^ between an optimal solution 9 to the convex program ([T]), and the unknown 
parameter 9*. 

Corollary 1. In addition to the conditions of TheoremUl suppose that 9* & Ai and p < TZ{9*). 
Then as long as ^'^(A^) (ru(£„) + rf(£„)) = o(l), we have 

pt+i _ 0f < fJ \\0O _ ^||2 ^o(||^- 9*f) for all iterations t = 0,1,2,.... (25) 

Thus, Corollary [1] guarantees that the optimization error decreases geometrically, with contraction 
factor K, up to a tolerance that is of strictly lower order than the statistical precision \\9 — 9*\\'^. 
As will be clarified in several examples to follow, the condition ^'^(A^) (tu(£„) + T£{Cn)) = o(l) 
is satisfied for many statistical models, including sparse linear regression and low-rank matrix re- 
gression. This result is illustrated in Figure [2]^b), where the solid circle represents the optimization 
tolerance, and the dotted circle represents the statistical precision. In the results to follow, we will 
quantify the term o[\\9 — in a more precise manner for different statistical models. 
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Convergence rates for composite gradient: We now present our main result for the composite 
gradient iterates ^ that are suitable for the Lagrangian-based estimator ([2]). As before, our analysis 
yields a range of bounds indexed by subspace pairs {M.,Ai ) that are 7^-decomposable. For any 
subspace Ai such that 64r£(£„)^'^(A^) < j£, we define effective RSC coefficient as 

Yi:=-ii-Un{Cn)^^{M). (26) 

This coefficient accounts for the residual amount of strong convexity after accounting for the lower 
tolerance terms. In addition, we define the compound contraction coefficient as 

where i{M) := (l - ^^Iid^^^lllMly^ ^ and A* = 9x„ - 0* is the statistical error vectoiH for a 
specific choice of p and A„ . As before, the coefficient k measures the geometric rate of convergence 
for the algorithm. Finally, we define the compound tolerance parameter 

e\A*;M,M) := 8C{M) (3{M) (6^'(>()||A*|| + 87^(^^x(r)))^ (28) 
where ^{M) := 2 f + 13§I^ii^JiBliMl) re{Cn) + 8r„(£„) + 2r£(£„). As with our previous result. 



the tolerance parameter determines the radius up to which geometric convergence can be attained. 

Recall that the regularized problem ^ involves both a regularization weight A„ , and a constraint 
radius p. Our theory requires that the constraint radius is chosen such that p > TZ{9*), which 
ensures that 9* is feasible. In addition, the regularization parameter should be chosen to satisfy 
the constraint 

Xn>2n*{VCn{9*)), (29) 

where TZ* is the dual norm of the regularizer. This constraint is known to play an important 
role in proving bounds on the statistical error of regularized M-estimators (see the paper [28] and 
references therein for further details). Recalling the definition ([2]) of the overall objective function 
(/)„(0), the following result provides bounds on the excess loss (/'n(^*) — (t>n{0\„)- 

Theorem 2. Consider the optimization problem ([2]) for a radius p such that 9* is feasible, and 
a regularization parameter Xn satisfying the bound (|29jl . and suppose that the loss function Cn 
satisfies the RSC/RSM condition with parameters {'ye,T£{Cn)) and (7u,t„(£„)) respectively. Let 
(A^, A^"*") be any IZ-decomposable pair such that 

K = K(£„,Ai) G [0,1), and i{M)(3{M) < K- (30) 

1 - n[Ln;M) 

Then for any tolerance parameter 5"^ > ^-^^j^r^y^ , we have 

4>nie')-M0xJ<S' forall t > /\ ^ + log^ log J ^ ) ( 1 + ^ 



log(l/K) V '^W ^Og{l/K 



(31) 



^When the context is clear, we remind the reader that we drop the subscript A„ on the parameter 9. 
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Remarks: Note that the bound ([3T]) guarantees the excess loss (/)n(^*) — 0n(^) decays geomet- 
rically up to any squared error (5^ larger than the compound tolerance ()28p . Moreover, the RSC 
condition also allows us to translate this bound on objective values to a bound on the optimization 
error ^* — 9. In particular, for any iterate 6^ such that (t)n{G^) — 4^n{G) < (5^, we are guaranteed that 

\W -^\n\\ <— H =79 \ = • l-^^J 

In conjunction with Theorem [JJ we see that it suffices to take a number of steps that is logarithmic 
in the inverse tolerance (1/(5), again showing a geometric rate of convergence. 

Whereas Theorem [1] requires setting the radius so that the constraint is active, Theorem [2] has 
only a very mild constraint on the radius namely that it be large enough such that p > 1Z{6*). 
The reason for this much milder requirement is that the additive regularization with weight A„ 
suffices to constrain the solution, whereas the extra side constraint is only needed to ensure good 
behavior of the optimization algorithm in the first few iterations. The regularization parameter 
A„ must satisfy the so-called dual norm condition (129p . which has appeared in past literature on 
statistical estimation, and is well-characterized for a broad range of statistical models (e.g., see the 
paper [28j and references therein). 



Step-size setting: It seems that the updates ^ and dH need to know the smoothness bound 
7u in order to set the step-size for gradient updates. However, we can use the same doubling trick 
as described in Algorithm (3.1) of Nesterov |32j . At each step, we check if the smoothness upper 
bound holds at the current iterate relative to the previous one. If the condition does not hold, we 
double our estimate of 7^ and resume. This guarantees a geometric convergence with a contraction 
factor worse at most by a factor of 2, compared to the knowledge of 7^. We refer the reader to 
Nesterov [32] for details. 

The following subsections are devoted to the development of some consequences of Theorems [1] 
and [2] and Corollary [T] for some specific statistical models, among them sparse linear regression 
with £i-regularization, and matrix regression with nuclear norm regularization. In contrast to 
the entirely deterministic arguments that underlie the Theorems [T] and O these corollaries involve 
probabilistic arguments, more specifically in order to establish that the RSC and RSM properties 
hold with high probability. 

3.2 Sparse vector regression 

Recall from Section 12.4.11 the observation model for sparse linear regression. In a variety of appli- 
cations, it is natural to assume that 0* is sparse. For a parameter q E [0, 1] and radius Rq > 0, let 
us define the iq "ball" 

d 

Mq{Rq) ■.= {9€R'^ \Y, < Rq}- (33) 
i=i 

Note that q = corresponds to the case of "hard sparsity", for which any vector /3 G Mq{Rq) is 
supported on a set of cardinality at most Rq. For q S (0, 1], membership in the set Mq{Rq) enforces 
a decay rate on the ordered coefficients, thereby modelling approximate sparsity. In order to 
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estimate the unknown regression vector 6* G Mq{Rq), we consider the least-squares Lasso estimator 
from Section r2.4.1l based on the quadratic loss function C{9; Zf) := ^\\y — X9\\2, where X G M""^*^ 
is the design matrix. In order to state a concrete result, we consider a random design matrix 
X, in which each row Xi G is drawn i.i.d. from a iV(0, E) distribution, where S is a positive 
definite covariance matrix. We refer to this as the Ti-ensemble of random design matrices, and 
use (Tmax(S) and (Tmm(S) to refer the maximum and minimum eigenvalues of S respectively, and 
C(S) := max Si, for the maximum variance. We also assume that the observation noise is 

j=l,2,...,d 

zero-mean and sub-Gaussian with parameter u"^. 

Guarantees for constrained Lasso: Our convergence rate on the optimization error 0* — is 
stated in terms of the contraction coefficient 

0"mm(S) 



where we have adopted the shorthand 



, (Mi^) for g = 



Xn(S) := { '^'"""l^(s)^ ^ nogd\ . " ^ ' ^ numerical constant cq, (35) 



We assume that Xn(S) is small enough to ensure that k G (0, 1); in terms of the sample size, this 
amounts to a condition of the form n = n{R^J^^ ''^^^ logd). Such a scaling is sensible, since it is 
known from minimax theory on sparse linear regression [35] to be necessary for any method to be 
statistically consistent over the ^g-ball. 

With this set-up, we have the following consequence of Theorem [1] 

Corollary 2 (Sparse vector recovery). Under conditions of Theorem\^ suppose that we solve the 
constrained Lasso with p < ||^*||i. 

(a) Exact sparsity: If 9* is supported on a subset of cardinality s, then with probability at least 
1 — exp(— ci \ogd), the iterates ([3]) with 7^ = 2(Tmax(S) satisfy 

\\9* -9\\l< K^\\9^ -9\\l + C2Xn{^)\\9 -9*\\l for all t = 0,1,2, .. .. (36) 



(b) Weak sparsity: Suppose that 9* G Mq{Rq) for some q G (0, 1]. Then with probability at least 
1 — exp(— ci logd), the iterates ([3| with 7^ = 2(Tmax(5^) satisfy 

\\9' -9\\l< \\9' -9\\l + C2 Xn(S) + 11^- r Hi}. (37) 

We provide the proof of Corollary [2] in Section 15.41 Here we compare part (a), which deals 
with the special case of exactly sparse vectors, to some past work that has established convergence 
guarantees for optimization algorithms for sparse linear regression. Certain methods are known to 
converge at sublinear rates (e.g., |4]), more specifically at the rate 0{l/t'^). The geometric rate of 
convergence guaranteed by Corollary [2] is exponentially faster. Other work on sparse regression has 
provided geometric rates of convergence that hold once the iterates are close to the optimum [HI 
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[18], or geometric convergence up to the noise level using various methods, including greedy 
methods [l2] and thresholded gradient methods p!7]. In contrast. Corollary [2] guarantees geometric 
convergence for all iterates up to a precision below that of statistical error. For these problems, 
the statistical error ^ '^^sd. -g typically much smaller than the noise variance v^, and decreases as 
the sample size is increased. 

In addition. Corollary [2] also applies to the case of approximately sparse vectors, lying within 
the set Mq{Rq) for q G (0,1]. There are some important differences between the case of exact 
sparsity (Corollary [2ja)) and that of approximate sparsity (Corollary [2jb)). Part (a) guarantees 
geometric convergence to a tolerance depending only on the statistical error ||^ — ^*||2- In contrast, 
the second result also has the additional term Rq{^^^)^ '^^'^ ■ This second term arises due to the 
statistical non-identifiability of linear regression over the ^g-ball, and it is no larger than ||^ — 0*||2 
with high probability. This assertion follows from known results [35] about minimax rates for linear 
regression over ^g-balls; these unimprovable rates include a term of this order. 

Guarantees for regularized Lasso: Using similar methods, we can also use Theorem [2] to 
obtain an analogous guarantee for the regularized Lasso estimator. Here focus only on the case 
of exact sparsity, although the result extends to approximate sparsity in a similar fashion. Let- 
ting Ci,i = 0,1,2,3,4 be universal positive constants, we define the modified curvature constant 
7^ := 7£ — Co ^ ^n^'^ C(^)- Our results assume that n = $7(slog(i), a condition known to be necessary 
for statistical consistency, so that 7^ > 0. The contraction factor then takes the form 

'^•=U~TS 7^ + ciXn S } |1 - C2Xn S I , where Xn{^) = — • 

16o-max(S) 7£ n 

The tolerance factor in the optimization is given by 

2 5 + C2Xn(S) C(^) Slogd 2-||2 /oox 

1 - C3Xn(S) n 

where 9* G M*^ is the unknown regression vector, and 9 is any optimal solution. With this notation, 
we have the following corollary. 

Corollary 3 (Regularized Lasso). Under conditions of Theorem\^ suppose that we solve the reg- 
ularized Lasso with A„, = Gy^ '^^"^^ -, and that 9* is supported on a subset of cardinality at most s. 
Suppose that we have the condition 

1 , ^ ^ _L 64s log d/n 

, I < A„. (39) 

n It 128s log d/n - ^ ' 

4711 -yi 

Then with probability at least 1 — exp(— C4 log d), for any 6"^ > e^^;, for any optimum 9\^, we have 

\\9^-9xJl<S^ for all iterations t > (log V ( ^) • 

As with Corollary [21^a) , this result guarantees that 0(log(l/e^^j)) iterations are sufficient to obtain 
an iterate 0* that is within squared error 0{e^^^) of any optimum 9x,^- The condition (|39p is the 
specialization of Equation [30] to the sparse linear regression problem, and imposes an upper bound 
on admissible settings of p for our theory. Moreover, whenever ^^^-^ = o(l) — a condition that is 
required for statistical consistency of any method — the optimization tolerance e^^j is of lower order 
than the statistical error \\9* — 9\\2. 
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3.3 Matrix regression with rank constraints 

We now turn to estimation of matrices under various types of "soft" rank constraints. Recall 
the model of matrix regression from Section 12.4.21 and the M-estimator based on least-squares 
regularized with the nuclear norm (|19p . So as to reduce notational overhead, here we specialize to 
square matrices G* G R"^^"^, so that our observations are of the form 

yi = {{X^, e*)) + Wi, fori = l,2,...,n, (40) 

where Xi G M.'^^'^ is a matrix of covariates, and Wi ~ A^(0, z^^) is Gaussian noise. As discussed 
in Section [2.4.21 the nuclear norm Tl{@) = |||G|||i = X]j=i'^i(®) decomposable with respect to 
appropriately chosen matrix subspaces, and we exploit this fact heavily in our analysis. 

We model the behavior of both exactly and approximately low-rank matrices by enforcing 
a sparsity condition on the vector a{Q) = [ai{@) (T2(6) ••• of singular values. In 

particular, for a parameter q £ [0, 1], we define the ^g-"ban" of matrices 

d 

M,{R,) := {e G R"'"' I J2 ^ ^'?}- (41) 

i=i 

Note that if g = 0, then Mq{Rq) consists of the set of all matrices with rank at most r = Rq. On 
the other hand, for q G (0, 1], the set Mq(Rq) contains matrices of all ranks, but enforces a relatively 
fast rate of decay on the singular values. 

3.3.1 Bounds for matrix compressed sensing 

We begin by considering the compressed sensing version of matrix regression, a model first intro- 
duced by Recht et al. [37], and later studied by other authors (e.g., [24l ES]). In this model, the 
observation matrices G M"^^*^ are dense and drawn from some random ensemble. The simplest 
example is the standard Gaussian ensemble, in which each entry of Xi is drawn i.i.d. as standard 
normal A^(0, 1). Note that Xi is a dense matrix in general; this in an important contrast with the 
matrix completion setting to follow shortly. 

Here we consider a more general ensemble of random matrices Xi, in which each matrix 
Xi G W^^^ is drawn i.i.d. from a zero-mean normal distribution in M'^^ with covariance matrix 
T.(z]^d^xd\ The setting S = 1^2 xrf2 recovers the standard Gaussian ensemble studied in past work. 
As usual, we let (Tniax(5^) and crmin(5^) define the maximum and minimum eigenvalues of S, and we 
define Cmat(5^) = sup||„||2=i sup||^||2=i var {{{X, uv"^))), corresponding to the maximal variance of X 
when projected onto rank one matrices. For the identity ensemble, we have Cmat(-^) = 1- 

We now state a result on the convergence of the updates (f20]l when applied to a statistical prob- 
lem involving a matrix Q* G Mq{Rq). The convergence rate depends on the contraction coefficient 

where XnC^) ■= '^^''""'^.^^ Ra(-Y '^^'^ for some universal constant c\. In the case g = 0, correspond- 
ing to matrices with rank at most r, note that we have i?o = t. With this notation, we have the 
following convergence guarantee: 
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Corollary 4 (Low-rank matrix recovery). Under conditions of TheoremUl consider the seniidefinite 
program (I19p with p < |||0*|||i, and suppose that we apply the projected gradient updates (I20p with 

lu = 2(Tmax(5]). 

(a) Exactly low-rank: In the case q = 0, if 0* has rank r < d, then with probability at least 
1 — exp(— Cod), the iterates (f20]l satisfy the bound 

|||0* - 0|||f < - ©If + C2 Xn{^) III© - 01IIf for all t = 0,1,'^, ■■ ■■ (42) 



(b) Approximately low-rank: 7/0* G Mq{Rq) for some q £ (0,1], then with probability at least 
1 — exp(— Cod), the iterates (f20]) satisfy 

|||0*-0|||F<'«*ll|0°-0|ll?^ + c2Xn(s) +II|0-01IIf}, (43) 

Although quantitative aspects of the rates are different, Corollary [3] is analogous to Corollary [2j 
For the case of exactly low rank matrices (part (a)), geometric convergence is guaranteed up to 
a tolerance involving the statistical error |||0 — 0*|||p'. For the case of approximately low rank 

matrices (part (b)), the tolerance term involves an additional factor of iig(^)^ Again, from 
known results on minimax rates for matrix estimation [38] , this term is known to be of comparable 
or lower order than the quantity |||0 — 0*1111^. As before, it is also possible to derive an analogous 
corollary of Theorem [2] for estimating low-rank matrices; in the interests of space, we leave such a 
development to the reader. 



3.3.2 Bounds for matrix completion 

In this model, observation yj is a noisy version of a randomly selected entry 0*(j) of the unknown 
matrix 0*. Applications of this matrix completion problem include collaborative filtering |40j . 
where the rows of the matrix 0* correspond to users, and the columns correspond to items (e.g., 
movies in the Netflix database), and the entry 0*^ corresponds to user's a rating of item b. Given 
observations of only a subset of the entries of 0*, the goal is to fill in, or complete the matrix, 
thereby making recommendations of movies that a given user has not yet seen. 

Matrix completion can be viewed as a particular case of the matrix regression model (jlSp . 
in particular by setting Xi = corresponding to the matrix with a single one in position 

(a(i),6(i)), and zeroes in all other positions. Note that these observation matrices are extremely 
sparse, in contrast to the compressed sensing model. Nuclear-norm based estimators for matrix 
completion are known to have good statistical properties (e.g., [HI [Ml SQl [30] ) . Here we consider 
the M-estimator 

_ 1 " 2 

e argmin ^^{Vi- Qa(i)b(i)) such that |||0|||i < p, (44) 

i=l 

where = {0 G M"^^*^ | ||0||oo < is the set of matrices with bounded elementwise £oo norm. 
This constraint eliminates matrices that are overly "spiky" (i.e., concentrate too much of their mass 
in a single position); as discussed in the paper |30], such spikiness control is necessary in order to 
bound the non-identifiable component of the matrix completion model. 
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Corollary 5 (Matrix completion). Under the conditions of Theorem\^ suppose that Q* G Mq{Rq), 

and that we solve the program (I44p with p < |||0*|||i. As long as n > cqRI^^^ '^^'^^ dlogd for a 
sufficiently large constant cq, then with probability at least 1 — exp(— cidlog d), there is a contraction 
coefficient nt S (0, 1) that decreases with t such that for all iterations t = 0,1,2, ... , 

|||e*+i -e\il<4 1116° -e\il + c2 + ig _ e*||||}. (45) 

In some cases, the bound on ||6||oo in the algorithm (j44p might be unknown, or undesirable. 
While this constraint is necessary in general [30], it can be avoided if more information such as 
the sampling distribution (that is, the distribution of Xi) is known and used to construct the 
estimator. In this case, Koltchinskii et al. [22] show error bounds on a nuclear norm penalized 
estimator without requiring i^o bound on 0. 

Again a similar corollary of Theorem [2] can be derived by combining the proof of Corollary [5] 
with that of Theorem [21 An interesting aspect of this problem is that the condition ISnTb) takes 

the form A„ > f^H^^^^Z^^ where a is a bound on ||0||oo- This condition is independent of p, and 
hence, given a sample size as stated in the corollary, the algorithm always converges geometrically 
for any radius p > ||| Q* ||| i . 



3.4 Matrix decomposition problems 

In recent years, various researchers have studied methods for solving the problem of matrix de- 
composition (e.g., \12\ [TOl [U |T9]). The basic problem has the following form: given a pair of 
unknown matrices B* and T*, both lying in M'^^^'^^^ suppose that we observe a third matrix speci- 
fied by the model Y = Q* + r* + W , where W G represents observation noise. Typically the 
matrix Q* is assumed to be low-rank, and some low-dimensional structural constraint is assumed 
on the matrix T* . For example, the papers |12[ 1101 119] consider the setting in which T* is sparse, 
while Xu et al. [44] consider a column-sparse model, in which only a few of the columns of V* have 
non-zero entries. In order to illustrate the application of our general result to this setting, here 
we consider the low-rank plus column-sparse framework [33]. (We note that since the ^i-norm is 
decomposable, similar results can easily be derived for the low-rank plus entrywise-sparse setting 
as well.) 

Since 0* is assumed to be low-rank, as before we use the nuclear norm |||0|||i as a regularizer 
(see Section [2.4.2p . We assume that the unknown matrix T* G "^d,ixd2 jg column-sparse, say with 
at most s < d2 non-zero columns. A suitable convex regularizer for this matrix structure is based 
on the columnwise (1, 2)-norm, given by 

||r||i,2:= J]||r,||2, (46) 
i=i 

where Tj G M^^ denotes the j*^ column of F. Note also that the dual norm is given by the 
elementwise {00, 2)-norm ||F||oo,2 = ^SLXj^i^^^^ ,^^ l|rj||2i corresponding to the maximum ^2-norm 
over columns. 

In order to estimate the unknown pair (0*,F*), we consider the M-estimator 

(0, F) := argmin |||y — G — F||||^ such that |||0|||i < /CO) ||r||i 2 < Pr and ||G||oo 2 < (47) 
0,r ' ' Vd2 
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The first two constraints restrict G and F to a nuclear norm ball of radius pQ and a (1,2)- 
norm ball of radius pr, respectively. The final constraint controls the "spikiness" of the low-rank 
component Q, as measured in the (cx),2)-norm, corresponding to the maximum ^2-iiorm over the 
columns. As with the elementwise £oo-bound for matrix completion, this additional constraint is 
required in order to limit the non-identifiability in matrix decomposition. (See the paper [1] for 
more discussion of non-identifiability issues in matrix decomposition.) 

With this set-up, consider the projected gradient algorithm when applied to the matrix de- 
composition problem: it generates a sequence of matrix pairs (0*,r*) for t = 0,1,2,..., and the 
optimization error is characterized in terms of the matrices Aq := 0* — and Ap := F* — F. 
Finally, we measure the optimization error at time t in terms of the squared Frobenius error 
e^(AQ, Ap) := ||| Aq||||, -|- ||| Ap||||,, summed across both the low-rank and column-sparse components. 

Corollary 6 (Matrix decomposition). Under the conditions of TheoremUl suppose that ||0*||oo.2 ^ 
and F* has at most s non-zero columns. If we solve the convex program (l47|l with pQ < |||0*|||i and 
Pr ^ ||r*||i,2; then for all iterations t = 0,1,2, ... , 

e^(A*e,Af)< e2(A^, A°) + c (|||f - F*|||| + a^^) . 

This corollary has some unusual aspects, relative to the previous corollaries. First of all, in 
contrast to the previous results, the guarantee is a deterministic one (as opposed to holding with 
high probability). More specifically, the RSC/RSM conditions hold deterministic sense, which 
should be contrasted with the high probability statements given in Corollaries [2][5j Consequently, 
the effective conditioning of the problem does not depend on sample size and we are guaranteed 
geometric convergence at a fixed rate, independent of sample size. The additional tolerance term 
is completely independent of the rank of 0* and only depends on the column-sparsity of F*. 

4 Simulation results 

In this section, we provide some experimental results that confirm the accuracy of our theoretical 
results, in particular showing excellent agreement with the linear rates predicted by our theory. In 
addition, the rates of convergence slow down for smaller sample sizes, which lead to problems with 
relatively poor conditioning. In all the simulations reported below, we plot the log error \\9^ — 6\\ 
between the iterate 9^ at time t versus the final solution 9. Each curve provides the results averaged 
over five random trials, according to the ensembles which we now describe. 

4.1 Sparse regression 

We begin by considering the linear regression model y = X9*+w where 9* is the unknown regression 
vector belonging to the set Mq{Rg), and i.i.d. observation noise vui ~ iV(0,0.25). We consider a 
family of ensembles for the random design matrix X G R"^'^. In particular, we construct X by 
generating each row Xj G M'^ independently according to following procedure. Let zi,...,Zn be 
an i.i.d. sequence of A^(0, 1) variables, and fix some correlation parameter uj E [0,1). We first 
initialize by setting Xj^i = zi/Vl — co"^, and then generate the remaining entries by applying the 
recursive update Xi^t+i = f^a^i.t + zt ior t = 1,2, ... ,d — 1, so that Xj G R'' is a zero-mean Gaussian 
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random vector. It can be verified that all the eigenvalues of S = cov(xi) lie within the interval 
[ (i+cj)^ , (i_t^)2(x4-a;) ]' ^ ^ ^ finite condition number for all w G [0, 1). At one extreme, 

for a; = 0, the matrix E is the identity, and so has condition number equal to 1. As w — >• 1, the 
matrix S becomes progressively more ill-conditioned, with a condition number that is very large 
for u) close to one. As a consequence, although incoherence conditions like the restricted isometry 
property can be satisfied when oj = 0, they will fail to be satisfied (w.h.p.) once uj is large enough. 

For this random ensemble of problems, we have investigated convergence rates for a wide range 
of dimensions d and radii Rq. Since the results are relatively uniform across the choice of these 
parameters, here we report results for dimension d = 20,000, and radius Rq = [(logd)^]. In the 
case q = 0, the radius Rq = s corresponds to the sparsity level. The per iteration cost in this case 
is 0{nd). In order to reveal dependence of convergence rates on sample size, we study a range of 
the form n = \a s log d\ , where the order parameter q > is varied. 

Our first experiment is based on taking the correlation parameter a; = 0, and the ^^-ball 
parameter g = 0, corresponding to exact sparsity. We then measure convergence rates for sample 
sizes specified by a G {1,1.25,5,25}. As shown by the results plotted in panel (a) of Figure O 
projected gradient descent fails to converge for a = 1 or a = 1.25; in both these cases, the sample 
size n is too small for the RSC and RSM conditions to hold, so that a constant step size leads 
to oscillatory behavior in the algorithm. In contrast, once the order parameter a becomes large 
enough to ensure that the RSC/RSM conditions hold (w.h.p.), we observe a geometric convergence 
of the error — 9\\2. Moreover the convergence rate is faster for a = 25 compared to a = 5, since 
the RSC/RSM constants are better with larger sample size. Such behavior is in agreement with 
the conclusions of Corollary [2l which predicts that the the convergence rate should improve as the 
number of samples n is increased. 

a = 25 .14,' = ,t/ = 20000 
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Figure 3. Plot of the log of the optimization error log(||0* — 6'||2) in the sparse linear regression 
problem, rescaled so the plots start at 0. In this problem, d = 20000, s — [logrf], n = aslogd. Plot 
(a) shows convergence for the exact sparse case with q = and S = / (i.e. w = 0). In panel (b), we 
observe how convergence rates change as the correlation parameter lu is varied for q = and a = 25. 
Plot (c) shows the convergence rates when w = 0, a = 25 and q is varied. 



On the other hand. Corollary [2] also predicts that convergence rates should be slower when the 
condition number of S is worse. In order to test this prediction, we again studied an exactly sparse 
problem {q = 0), this time with the fixed sample size n = [25s log d] , and we varied the correlation 
parameter uj £ {0,0.5,0.8}. As shown in panel (b) of Figure O the convergence rates slow down 



23 



as the correlation parameter is increased and for the case of extremely high correlation of a; = 0.8, 
the optimization error curve is almost flat — the method makes very slow progress in this case. 

A third prediction of Corollary [2] is that the convergence of projected gradient descent should 
become slower as the sparsity parameter q is varied between exact sparsity {q = 0), and the 
least sparse case {q = 1). (In particular, note for n > logd, the quantity Xn from equation ([35|) is 
monotonically increasing with q.) Panel (c) of Figure [3] shows convergence rates for the fixed sample 
size n = 25s log d and correlation parameter w = 0, and with the sparsity parameter q £ {0, 0.5, 1.0}. 
As expected, the convergence rate slows down as q increases from to 1. Corollary[2]further captures 
how the contraction factor changes as the problem parameters (s, d, n) are varied. In particular, 
it predicts that as we change the triplet simultaneously, while holding the ratio a = slogd/n 
constant, the convergence rate should stay the same. We recall that this phenomenon was indeed 
demonstrated in Figure [1] in Section [TJ 

4.2 Low-rank matrix estimation 

We also performed experiments with two different versions of low-rank matrix regression. Our 
simulations applied to instances of the observation model yi = {{Xi, Q*)) + Wi, for z = 1, 2, . . . , n, 
where G* G ]^200x200 £xed unknown matrix, Xi G ]^200x200 jg matrix of covariates, and 
Wi ~ A^(0, 0.25) is observation noise. In analogy to the sparse vector problem, we performed sim- 
ulations with the matrix Q* belonging to the set Mq{Rq) of approximately low-rank matrices, as 
previously defined in equation (|4ip for q G [0, 1]. The case q = corresponds to the set of matrices 
with rank at most r = Rq, whereas the case q = 1 corresponds to the ball of matrices with nuclear 
norm at most Ri. 
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Figure 4. (a) Plot of log Frobenius error log(|||8* — 0|||f) versus number of iterations in matrix 
compressed sensing for a matrix size d = 200 with rank Rq — 5, and sample sizes n = aR^d. For 
a e {1,1.25}, the algorithm oscillates, whereas geometric convergence is obtained for a G {5,25}, 
consistent with the theoretical prediction, (b) Plot of log Frobenius error log(|||0* — 0|||i?) versus 
number of iterations in matrix completion with d = 200, Rq = 5, and n = aRodlog{d) with a G 
{1, 2, 5, 25}. For a G {2, 5, 25} the algorithm enjoys geometric convergence. 
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In our first set of matrix experiments, we considered the matrix version of compressed sens- 
ing [36], in which each matrix Xi G M^'^'^^^'^'^ is randomly formed with i.i.d. A^(0, 1) entries, as 
described in Section r3.3.1[ In the case g = 0, we formed a matrix Q* £ ]^200x200 -^[H^ rank Rq = 5, 
and performed simulations over the sample sizes n = aRo d, with the parameter a G {1, 1.25, 5, 25}. 
The per iteration cost in this case is 0{nd?). As seen in panel (a) of Figured the projected gra- 
dient descent method exhibits behavior that is qualitatively similar to that for the sparse linear 
regression problem. More specifically, it fails to converge when the sample size (as reflected by the 
order parameter a) is too small, and converges geometrically with a progressively faster rate as a 
is increased. We have also observed similar types of scaling as the matrix sparsity parameter is 
increased from q = to q = 1. 

In our second set of matrix experiments, we studied the behavior of projected gradient de- 
scent for the problem of matrix completion, as described in Section 13.3.21 For this problem, we 
again studied matrices of dimension d = 200 and rank i?o = 5, and we varied the sample size as 
n = a Rq dlog d for a E {1, 2, 5, 25}. As shown in panel (b) of Figure [H projected gradient descent 
for matrix completion also enjoys geometric convergence for a large enough. 

5 Proofs 

In this section, we provide the proofs of our results. Recall that we use A* ■.= 6^ — 9 to denote 
the optimization error, and A* = 9 — 6* to denote the statistical error. For future reference, we 
point out a slight weakening of restricted strong convexity (RSC), useful for obtaining parts of our 
results. As the proofs to follow reveal, it is only necessary to enforce an RSC condition of the form 

Tcie'; 0)>^ \\9' - ef - n{Cn) n\9' -9)- 6\ (48) 

which is milder than the original RSC condition ([8]), in that it applies only to differences of the 
form 0* — 9, and allows for additional slack 5. We make use of this refined notion in the proofs of 
various results to follow. 

With this relaxed RSC condition and the same RSM condition as before, our proof shows that 

_ ef < 11^0 - 9f + ^'(^*;-^'-M) + 2'^V7^ iterations t = 0, 1, 2, . . .. (49) 

1 — K 

Note that this result reduces to the previous statement when 5 = This extension of Theorem [T] 
is used in the proofs of Corollaries [5] and [H 

We will assume without loss of generality that all the iterates lie in the subset fi' of 17. This can 
be ensured by augmenting the loss with the indicator of or equivalently performing projections 
on the set Vt' H M'ji{p) as mentioned earlier. 

5.1 Proof of Theorem [1] 

Recall that Theorem [1] concerns the constrained problem ([1]) . The proof is based on two technical 
lemmas. The first lemma guarantees that at each iteration t = 0, 1, 2, . . ., the optimization error 
A'' = 9* — 9 belongs to an interesting constraint set defined by the regularizer. 
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Lemma 1. Let 6 be any optimum of the constrained problem ([TJ for which TZ{9) = p. Then for 
any iteration t = 1,2,... and for any TZ- decomposable subspace pair the optimization 

error A* ■.= 9^ — 9 belongs to the set 

8iM;M;9*) :=|a g n \ 7^(A) < 2^{M) ||A|| + 2n{Uj^±{9*)) + 2TZ{A*) + ^{M)\\A*\\y (50) 

The proof of this lemma, provided in Appendix lA.il exploits the decomposability of the regularizer 
in an essential way. 

The structure of the set ()50p takes a simpler form in the special case when A4 is chosen to 
contain 9* and Ai = Ai. In this case, we have TZ{Ilj^±{9*)) = 0, and hence the optimization error 
A* satisfies the inequality 

7^(A*) < 2^{M){\\A^\\ + ||A*||} +27^(A*). (51) 

An inequality of this type, when combined with the definitions of RSC/RSM, allows us to establish 
the curvature conditions required to prove globally geometric rates of convergence. 

We now state a second lemma under the more general RSC condition (j48p : 

Lemma 2. Under the RSC condition (|48|) and RSM condition pU|) . for all t = 0,1,2, ... , we have 
j^{9'-9'+\ 9' -9) 

> {^11^* _ ^m||2 _ ^4£„)7^2(^*+i _ ^t)| ^ ||||^t _ ^||2 _ r,{Cn)n\9' -9)- 5^}. (52) 

The proof of this lemma, provided in Appendix \A.2\ follows along the lines of the intermediate 
result within Theorem 2.2.8 of Nesterov pT|, but with some care required to handle the additional 
terms that arise in our weakened forms of strong convexity and smoothness. 



Using these auxiliary results, let us now complete the the proof of Theorem [TJ We first note 
the elementary relation 

ll^t+l _ ^||2 ^ pt _Q_Qt ^ g|i+l||2 ^ pt _ ^||2 _^ pt _ ^t+l||2 _ 2(^Qt -9,9* - 6**+^). (53) 

We now use Lemma [2] and the more general form of RSC ()48p to control the cross-term, thereby 
obtaining the upper bound 

_ ef < \\9' - 9f - ^\\9' - 9f + ^^:i^^7^2(0^+l _ e') + M^7^2(^^ -9) + — 

= (1 _ 2:^) ll^t _ Qf + '^:i2!^Ti^(Qt^^ _ ^t) + ^!^^^7^2(^^ -9)^—. 

lu lu lu lu 

We now observe that by triangle inequality and the Cauchy-Schwarz inequality, 

^2(^t+i _ Qt^ < (7^(0*+i _ ^) +7^(^_ Qt)f < 27^2(0*+l -9)^ 27^2(e* - ^). 
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Recall the definition of the optimization error A* := 9^ — 9, we have the upper bound 

7n 7m 7« 7n 

We now apply Lemma [T] to control the terms involving TZ'^. In terms of squared quantities, the 
inequality ([50]) implies that 

7^2(A*) < 4^^{M^) \\A^f + 21^"^ {A*; M,M) for all i = 0,1,2,..., 

where we recall that '^'^{Ai^) is the subspace compatibility p2|) and u"^ (A* ; Ai , Ai) accumulates 
all the residual terms. Applying this bound twice — once for t and once for t + 1 — and substituting 

into equation ([Ml) yields that {l - i6'f'(-^^^K(^n) | y^t+i p -g ^pp^^ bounded by 

(l_2l^ 16^'(-M^)K(^n) + rejCn)) 1 ^ 16(T„(£„) + TH/:n))z^'(A*;.M,.M) ^ 2^ 

7« 7m J 7« 7« ' 

re-arrange this inequality into the form 



Under the assumptions of Theorem [H we are guaranteed that ^"(^"J < 1/2, and so we can 



2^2 

||A*+if < K\\A'f + e\A*;M,M) + — (55) 

7« 

where k and e^(A*; A^, A^) were previously defined in equations (I22p and (I23p respectively. Iterating 
this recursion yields 



|A'+l|p < k' ||A»|P+( £2(A'; A<,>() 



The assumptions of Theorem [1] guarantee that k G (0,1), so that summing the geometric series 
yields the claim (I24p . 



5.2 Proof of Theorem [2] 

The Lagrangian version of the optimization program is based on solving the convex program ([2]), 
with the objective function (j){9) = Cn{9) + \n'R-{9). Our proof is based on analyzing the error 
(t){9^) — <P{9) as measured in terms of this objective function. It requires two technical lemmas, both 
of which are stated in terms of a given tolerance r/ > 0, and an integer T > such that 

<i){9^) - (p(9) < f] for all t > T. (56) 

Our first technical lemma is analogous to LemmalU and restricts the optimization error A^ = 9^ — 9 
to a cone-like set. 

Lemma 3 (Iterated Cone Bound (ICB)). Let 9 be any optimum of the regularized M -estimator ([2]). 
Under condition (|56l) with parameters {T,ff), for any iteration t >T and for any TZ-decomposable 
subspace pair (Ai ,Ai ) , the optimization error A^ -.= 9* -9 satisfies 

7^(A*) < 4*(A1)||A*|| +8*(A^)||A*|| +87^(^;v,x(r)) + 2min (57) 
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Our next lemma guarantees sufficient decrease of the objective value difference 4>{9^) — 4>{6). 
Lemma [3] plays a crucial role in its proof. Recall the definition (1270 of the compound contraction 
coefficient K{Cn].M), defined in terms of the related quantities ^(A^) and /3(A^). Throughout the 
proof, we drop the arguments of k, ^ and /3 so as to ease notation. 

Lemma 4. Under the RSC (1481) and RSM conditions (jlOp . as well as assumption (|56p with 
parameters {f),T), for all t >T, we have 

I — K 

where e := 2 min(7//A„, ^) and Istat ■= 8^(A^)||A*|| + 8n{Uj^±{e*)). 

We are now in a position to prove our main theorem, in particular via a recursive application 
of Lemma m At a high level, we divide the iterations t = 0, 1, 2, . . . into a series of disjoint epochs 
[Tk,Tk+i) with = To < Ti < T2 < • • • . Moreover, we define an associated sequence of tolerances 
Vo > > ' " such that at the end of epoch T^,), the optimization error has been reduced to 

fjk- Our analysis guarantees that (p{9^) — (j){0) < fjk for all t > Tfc, allowing us to apply Lemma [4] 
with smaller and smaller values of f/ until it reduces to the statistical error egtat- 

At the first iteration, we have no a priori bound on the error f/o = — (t>{9)- However, since 
Lemma m involves the quantity e = min(r7/A„, p), we may still apply at the first epoch with 
Eq = p and To = 0. In this way, we conclude that for all t > 0, 

4>{e') - m < k\cP{6^) - m) + -r^mp' + elat)- 

L — K 

Now since the contraction coefficient k G (0, 1), for all iterations t>Ti:={ [log(2 7?o/?7i)/ log(l/'^)l )+> 
we are guaranteed that 

m - m < ^(P^ + 4at) < ^ max(p-2, e-Lj. 

^ V ' 

This same argument can now be applied in a recursive manner. Suppose that for some k >\, 
we are given a pair {fjk,Tk) such that condition (I56p holds. An application of Lemma |4] yields the 
bound 

0(0*) - m < k'-^hho^') - m) + T^i4 + 6-lj for all t > n. 

i K 

We now define := j^i^l + Cstat)- Once again, since k < 1 by assumption, we can choose 
Tk^i := [log(2f/fc/f/fc_|_i)/ log(l/K)] + Tfc, thereby ensuring that for all t > Tk+i, we have 

± /s^ 

®It is for precisely this reason that our regularized M-estimator includes the additional side-constraint defined in 
terms of p. 
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In this way, we arrive at recursive inequalities involving the tolerances {??fc}fcLo time steps 
{Tfcl^Q— namely 

fjk+i < r. max(4, Estat); where £k = 2 min{%/An, p}, and (58a) 

^ . , , log(2^r?o/r?fc) 

Tk<k + — — ■ (58b 

log(l/K) 

Now we claim that the recursion (158aP can be unwrapped so as to show that 



m+i<^ and ^ <^ for all A: = 1,2,.... (59) 



Taking these statements as given for the moment, let us now show how they can be used to upper 
bound the smallest k such that r/^ < J^. If we are in the first epoch, the claim of the theorem 
is straightforward from equation (j58ap . If not, we first use the recursion (j59p to upper bound the 
number of epochs needed and then use the inequality ()58bp to obtain the stated result on the total 
number of iterations needed. Using the second inequality in the recursion (j59p . we see that it is 
sufficient to ensure that -f^^ < 5"^. Rearranging this inequality, we find that the error drops 
below (5^ after at most 

ks > log flog (^) /log(4)^ /log(2) + 1 = log2log2 



epochs. Combining the above bound on kg with the recursion I58bl we conclude that the inequality 
(j){9) < S'^ is guaranteed to hold for all iterations 



log(l/K)7 log(l/Av)' 
which is the desired result. 

It remains to prove the recursion (|59p . which we do via induction on the index k. We begin with base 
case k = 1. Recalling the setting of fji and our assumption on in the theorem statement (|30p . 
we are guaranteed that f/i/A„ < p/A, so that £1 < £0 = p- By applying equation (I58aj) with 
ei = 2f]i/\n and assuming £1 > egtat, we obtain 

(1 - k)A^ (1 - k)4A„ 4 

where step (i) uses the fact that ^"^41 and step (ii) uses the condition (pUp on A„. We have thus 
verified the first inequality (f59P for k = 1. Turning to the second inequality in the statement ([59]) . 
using equation [6OI we have 

m ^ m_ ^''^ P_ 
Xn - 4An - 16' 

where step (iii) follows from the assumption (I30p on A„. Turning to the inductive step, we again 
assume that 2f]k/Xn > Cstat and obtain from inequality (j58ap 

_ 32CP4 (») 32e/3%p (-) % 
< < < 



(1 - k)XI - (1 - ,^)A„42'=-^ - 42^- 
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Here step (iv) uses the second inequality of the inductive hypothesis ([59]) and step (v) is a conse- 
quence of the condition on A„ as before. The second part of the induction is similarly established, 
completing the proof. 

5.3 Proof of Corollary [1] 

In order to prove this claim, we must show that e^{A*; Ai, Ai), as defined in equation (i23]l . is 
of order lower than ^[\\9 — 0*\\'^] = E[||A*|p]. We make use of the following lemma, proved in 
Appendix O 

Lemma 5. If p < TZ{6*), then for any solution 6 of the constrained problem ^ and any TZ- 
decomposable subspace pair the statistical error A* = — 9* satisfies the inequality 

7^(A*) < 2^{M^)\\A*\\+n{Uj^±{e*)). (6i) 

Using this lemma, we can complete the proof of Corollary [TJ Recalling the form ()23p , under the 
condition 6* £ Ai, we have 



e\A*-M M) ■= + rejCn)) (27^(A*) + ^{M^)\\A*\ 



v2 



lu 

Using the assumption K(^n)+r,(£„))i-2(>i^) = ^(l), it suffices to show that n{A*) < 2^{M^)\\A*\\. 
Since Corollary [1] assumes that 9* G M and hence that Ilj\^±{6*) = 0, Lemma [5] implies that 
7^(A*) < 2^'(AT-L)||A*||, as required. 

5.4 Proofs of Corollaries [2] and [3] 

The central challenge in proving this result is verifying that suitable forms of the RSC and RSM 
conditions hold with sufficiently small parameters r£(£„) and t„(£„). 

Lemma 6. Define the maximum variance ({T.) := max Under the conditions of Corol- 

j=l,2,...,d 

laryl^ there are universal positive constants (co,ci) such that for all A G M°', we have 

\\XA\\l ^ i||5^i/2^||2 _ ciC(S)i^ ||A||2, and (62a) 



n 2 n 



< 2||sV2a||2 + ciC(S)^^ ||A||2 (62b) 
n n 

with probability at least 1 — exp(— cq n). 

Note that this lemma implies that the RSC and RSM conditions both hold with high probability, 
in particular with parameters 

li = ■^'7min(S), and Ti{Cn) = ciC(S)^^^, for RSC, and 
2 n 

7n = 2fTmax(S) and T„(£„) = ciC(S)-^^ for RSM. 

n 

This lemma has been proved by Raskutti et al. [34J for obtaining minimax rates in sparse linear 
regression. 
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Let us first prove Corollary [2] in the special case of hard sparsity {q = 0), in which 9* is supported 
on a subset S of cardinality s. Let us define the model subspace Ai := {6 ^ \ 6j = for all j ^ S*} , 
so that 6* G Ai. Recall from Section [2.4.11 that the £i-norm is decomposable with respect to A4 
and A^"*"; as a consequence, we may also set ^A-^ = in the definitions ([22]) and ([23l) . By def- 
inition (jl2p of the subspace compatibility between with £i-norm as the regularizer, and ^2-iiorm 
as the error norm, we have ^'^(A4) = s. Using the settings of Ti{Cn) and Tu{Cn) guaranteed by 
Lemma [6] and substituting into equation (j22p , we obtain a contraction coefficient 

where Yn(S) := ^^^r^ for some universal constant C2. A similar calculation shows that the 

tolerance term takes the form 

(- ||A*|P "1 

e^{ll*]M,M) < C3 [li + ||A*||2| for some constant C3. 

Since p < 1 1 0* 1 1 1 , then Lemma[5] (as exploited in the proof of Corollary [1]) shows that 1 1 A* 1 1 f < 4s 1 1 A* 1 1 2 , 
and hence that e^(A*; A^, A^) < C3 Xn(S) ||A*|||. This completes the proof of the claim (|36]l for 
g = 0. 

We now turn to the case q G (0, 1], for which we bound the term e^(A*; Al , A() using a slightly 
different choice of the subspace pair M. and M. . For a truncation level > to be chosen, define 
the set S^ji_ := {j € {1,2, ... ,d} | \6*j\ > /i}, and define the associated subspaces M. = A/(5^) and 

Ai"*" = M-^{S^j). By combining Lemma [5] and the definition ([23]) of e^(A*; Ai, Ai), for any pair 
{M{S^)M^{S^)), we have 

e\l^*-M,M^) < i^(l|n^x(r)||i + ^||A*||2)^ 

where to simplify notation, we have omitted the dependence of M. and on S^. We now choose 
the threshold /x optimally, so as to trade-off the term ||n_y^^x(^*)||i, which decreases as p. increases, 
with the term y^S'^ll A* II2, which increases as p increases. 
By definition of M.^[S^), we have 

\f)*\ / \ 1 

where the inequality holds since \9*\ < /i for all j ^ S^. Now since 9* G Mq{Rq), we conclude that 

||n^x(r)||i < E l^il' < (64) 

On the other hand, again using the inclusion 9* G Mg{Rq), we have Eg > X]je5^ l^jl"^ — I'^mI /""^ 
which implies that \Sf^\ < p,~'^Rq. By combining this bound with inequality (|64|) . we obtain the 
upper bound 

e^(A*; A^, A^^) < i^(/.^-^%^ + P-^R,W\\1) = ^^^.-^Rq{^?^mq + ||A*||i). 
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Setting ^2 = then yields 

e\A*;M,M^) < + ||A1^}, where Xn(S) := ^R^C-^r^^'- 

Finally, let us verify the stated form of the contraction coefficient. For the given subspace 
M-^ = Ai{S^) and choice of fi, we have ^^(A^-'-) = \S^\ < jjL^^Rq. From Lemma[6l we have 

and hence, by definition ()22p of the contraction coefficient, 

< {l - ^ + Xn(S)} {l - Xn(S)}"\ 

For proving Corollary [3l we observe that the stated settings 7i, Xn(S) and k follow directly 
from Lemma[6j The bound for condition [2j a) follows from a standard argument about the suprema 
of d independent Gaussians with variance v. 

5.5 Proof of Corollary |4] 

This proof is analogous to that of Corollary El but appropriately adapted to the matrix setting. We 
first state a lemma that allows us to establish appropriate forms of the RSC/RSM conditions. Recall 
that we are studying an instance of matrix regression with random design, where the vectorized 
form vec(X) of each matrix is drawn from a A^(0, S) distribution, where S G M^^^"'^ is some 
covariance matrix. In order to state this result, let us define the quantity 

Cmat(5^) := sup wai{u^Xv), where vec(X) ~ A^(0, S). (65) 

\\u\\2 = l, \\v\\2=l 

Lemma 7. Under the conditions of Corollary^ there are universal positive constants (cq, ci) such 
that 

and (66a) 
for all A e R'^^'^. (66b) 
with probability at least 1 — exp(— cq n). 

Given the quadratic nature of the least-squares loss, the bound (j66ap implies that the RSC condition 
holds with 7£ = i(Tmin(S) and T£{Cn) = ciCniat(S)^, whereas the bound (I66bj) implies that the RSM 
condition holds with 7^ = 2(Tmax(S) and r„(£„) = ciCat(S)f . 

We now prove Corollary U] in the special case of exactly low rank matrices (g = 0), in which 0* 
has some rank r < d. Given the singular value decomposition 0* = UDV^ , let and be the 
d X r matrices whose columns correspond to the r non-zero (left and right, respectively) singular 
vectors of Q* . As in Section 12.4.21 define the subspace of matrices 

M{U'', V) ■= {e G W^""^ I col(e) C U'' and row(e) C y^}, (67) 



n 1 n 



n n 
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as well as the associated set Ai {U^,V^). Note that 0* € by construction, and moreover (as 
discussed in Section Y2A.2\ the nuclear norm is decomposable with respect to the pair {M.,A4'^). 

By definition (jl2p of the subspace compatibility with nuclear norm as the regularizer and Frobe- 
nius norm as the error norm, we have ^'^(TW) = r. Using the settings of T£(£„) and T„(£n) 
guaranteed by Lemma [7] and substituting into equation (j22p , we obtain a contraction coefficient 



where XnC^) '■= some universal constant C2. A similar calculation shows that the 

tolerance term takes the form 

f III A* IIP "1 

e^{A*;M,M) < C3 Xn(S)|^'^ — + |||A*||||,| for some constant C3. 

Since p < |||0*|||i by assumption. Lemma [5] (as exploited in the proof of Corollary [1]) shows that 
III A* If < 4r|||A*||||,, and hence that 

e\A*;M,M)<C3 Xni^) |||A*||||, 

which show the claim ()42p for q = 0. 

We now turn to the case q £ (0,1]; as in the proof of this case for Corollary [2l we bound 
e^(A*; A^,7W) using a slightly different choice of the subspace pair. Recall our notation cji(©*) > 
o"2(0*) > • • • > Cd(0*) ^ for the ordered singular values of 0*. For a threshold fi to be chosen, 
define 5^ = {j G {1,2, . . . ,d} | aj{e*) > /u}, and U{Sf,) e M'^^I^mI be the matrix of left singular 
vectors indexed by S^, with the matrix V{Sf^) defined similarly. We then define the subspace 
M.{S^) := J^{U{S^),V{S^)) in an analogous fashion to equation ([67|) . as well as the subspace 

M^{S^). _ 

Now by a combination of Lemma [5] and the definition (j23p of e^(A*; A^, A^), for any pair 
{MiS^),M^iS^)), we have 

e\A*;M,M^) < E -,(e*) + v^|||A*^)^ 

where to simplify notation, we have omitted the dependence of A4 and A4'^ on S^. As in the proof 
of Corollary [21 we now choose the threshold /j, optimally, so as to trade-off the term J2ji^s^ ^ji^*) 
with its competitor y^[5^ ||| A* ||| . Exploiting the fact that G* € Mq{Rg) and following the same 
steps as the proof of Corollary [2] yields the bound 

e\A*;M,M^) <^^^^^{p^-^^Rl + ,,-'}Rg\lA*\ll). 



Setting 11^ = ^ then yields 



as claimed. The stated form of the contraction coefficient can be verified by a calculation analogous 
to the proof of Corollary [2l 
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5.6 Proof of Corollary [5] 



In this case, we let Xn '■ M — )• M" be the operator defined by the model of random signed matrix 
sampling [30]. As previously argued, establishing the RSM/RSC property amounts to obtaining 



a form of uniform control over 



ll^n(e)||^ 



. More specifically, from the proof of Theorem [H we see 
that it suffices to have a form of RSC for the difference A* = 0* — 0, and a form of RSM for the 
difference 0*+^ — 0*. The following two lemmas summarize these claims: 



Lemma 8. There is a constant c such that for all iterations t = 0,1, 2, 
with probability at least 1 — exp(— dlog d), 



and integers r = 1,2, . . . ,d — 1, 



1, 



n 



> - A' 
2' 



ca 



n 



x 



+ a 



rd log d 



n 



+ III A* 



(69) 



Lemma 9. There is a constant c such that for all iterations t = 0, 1, 2, . . . and integers r = 1,2, . . . ,d — 1, 
with probability at least 1 — exp(— dlog d), the difference F* := 0*"*"^ — 0* satisfies the inequality 



l|Xn(r' 



< 2|||r*|||^ + (5„(r), where 



6u{r) := ca 



rdlogd(ZUr+i^:i(®* 



+ a 



n 



rd log d 



n 



+ |||A*|||i. + |||A*|||ir + 



ll|A*+i|||i.}. 



We can now complete the proof of Corollary [5] by a minor modification of the proof of Theorem[TJ 
Recalling the elementary relation (j53p . we have 

|||0*+^ - ©ill = III©* - ©If + III©* - ©*"*'^IIIf - 2((©* -©,©*- ©*+^)). 

From the proof of Lemma [21 we see that the combination of Lemma [5| and [U] (with 7^ = | and 
7„ = 2) imply that 

2((0* - 0*+i, 0* - 0)) > |||0* - 0*+^||||. + ^1©* - ©111^ - 6u{r) - Se{r) 



and hence that 



|||A*+i||||<^|||A*|||| + 5,(r) + '5.(r). 



We substitute the forms of 6£{r) and 5u{r) given in Lemmas [8l and [9l respectively; performing some 
algebra then yields 



, rd loe d rd loe; d 

Consequently, as long as min{||| A*||||., |||A*+^||||,} > €30 '''^'°^^ for a sufficiently large constant C3, we 
are guaranteed the existence of some Ht G (0, 1) decreasing with t such that 

|||A*+i||||. < k|||A*|||^ + d5i{r). (70) 
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Since 5i{r) = r2( ^'^'°^^ ), this inequality ([70|) is valid for all t = 0, 1, 2, . . . as long as c' is sufficiently 
large. Now iterating this bound, we see that 

Since Kt is decreasing in t, we observe that the second term in the above bound is at most 



c 6e{r) + Kti^t-i H h < c 6e{r) + kI + k\ - ^' Y 



s=2 



We also define Rt = ^Cl2,s=i^t) /t- Then the arithmetic mean-geometric mean inequality yields 
the upper bound ns=i — ^l- Combining this with our earlier upper bound further yields the 
inequality 



c' 



|||A*+i||||<K*|||A0|||| + -^5,(r). (71) 

i — Kl 

It remains to choose the cut-off r G {1, 2, . . . , d — 1} so as to minimize the term 5i{r). In particu- 
lar, when 0* G Bq(i?g), then as shown in the paper [29], the optimal choice is r x a~'^i?g( ^j"g^ )'^^^. 
Substituting into the inequality (ITT]) and performing some algebra yields that there is a universal 
constant C4 such that the bound 



ll|A*+^lll^ < -*II|A°|||^ + ^{Rl<'d\ogdy^,„ ^ / adlogd 

1 — Ki n V n 

holds. Now by the Cauchy-Schwarz inequality we have 



^^^«dlogd^i-,/2 iii^^iii^ ^ l^^^ad\ogdy.„2 ^ 



n I n 

and the claimed inequality (j45|) follows. 

5.7 Proof of Corollary [6] 

Again the main argument in the proof would be to establish the RSM and RSC properties for 
the decomposition problem. We define Aq = 0* — B and Ap = F* — F. We start with giving 
a lemma that establishes RSC for the differences (Aq, Ap). We recall that just like noted in the 
previous section, it suffices to show RSC only for these differences. Showing RSC/RSM in this 
example amounts to analyzing |||Aq + Ap||||,. We recall that this section assumes that F* has only 
s non-zero columns. 

Lemma 10. There is a constant c such that for all iterations t = 0,1,2, ... , 

lA'e + AMI > ^(|||A*,|||| + IIAMI) - cayj(|||r - r|||^ + a^J) (72) 
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This proof of this lemma fohows by a straightforward modification of analogous results in the pa- 
per [Ij. 

Matrix decomposition has the interesting property that the RSC condition holds in a determin- 
istic sense (as opposed to with high probability). The same deterministic guarantee holds for the 
RSM condition; indeed, we have 

|||Ak + Af||||<2(|||A*,|||| + |||Af|||^), (73) 

by Cauchy-Schwartz inequality. Now we appeal to the more general form of Theorem [1] as stated 
in Equation 1491 which gives 

ll|A*+^|||| + 1114+^11 < (I)* (|||A° + |||A° ill) + c^lj (if - T*y + ^) . 
The stated form of the corollary follows by an application of Cauchy-Schwarz inequality. 

6 Discussion 

In this paper, we have shown that even though high-dimensional M-estimators in statistics are 
neither strongly convex nor smooth, simple first-order methods can still enjoy global guarantees of 
geometric convergence. The key insight is that strong convexity and smoothness need only hold 
in restricted senses, and moreover, these conditions are satisfied with high probability for many 
statistical models and decomposable regularizers used in practice. Examples include sparse linear 
regression and ^i-regularization, various statistical models with group-sparse regularization, matrix 
regression with nuclear norm constraints (including matrix completion and multi-task learning), 
and matrix decomposition problems. Overall, our results highlight some important connections 
between computation and statistics: the properties of M-estimators favorable for fast rates in a 
statistical sense can also be used to establish fast rates for optimization algorithms. 
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A Auxiliary results for Theorem [1] 

In this appendix, we provide the proofs of various auxiliary lemmas required in the proof of Theo- 
rem [H 

A.l Proof of Lemma [1] 

Since 0* and 9 are both feasible and 9 lies on the constraint boundary, we have 1Z{9^) < 1Z{9). Since 
'1Z{9) < TZ{9*) + TZ{9 — 9*) by triangle inequality, we conclude that 

7^(0*) < n{9*)+n{A*). 
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Since 6* = Yij^iO*) + IIj^a_{6*), a second application of triangle inequality yields 

7^(0*) <7^(^A^(r)) + 7^(^^x(r)) + 7^(A*). (74) 

Now define the difference A* := 6^ — 6*. (Note that this is slightly different from A*, which is 
measured relative to the optimum 9.) With this notation, we have 

n{9') = n{UM{0*) + n^x(r) + n^(A*) + n^x(A*)) 

> n{UM (0*) + n^x (A*)) - 7^(^^x (r ) + (a*)) 

(ii) 

> 7^(^A^(r) + ^^x(A*)) -7^(^^x(r))-7^(^^(A*)), 

where steps (i) and (ii) each use the triangle inequality. Now by the decomposability condition, we 
have TZ{UMiO*) + Uj^±{A^)) = 7^(^y^^((9*)) + n{Uj^±{A*)), so that we have shown that 

7^(^^,(r)) + 7^(^^x(A*)) -7^(^^x(r)) -7^(^^(A*)) < n{0'). 

Combining this inequality with the earlier bound (j74p yields 

7^(^^,(r)) +7^(^^x(A*)) -7^(^^x(r)) -7i(n^(A*)) < 7^(^^(r)) +7^(^^x(r)) +7^(A*). 

Re-arranging yields the inequality 

7^(^^x(A*)) <7^(^^(A*)) + 27^(^^x(r)) + 7^(A*). (75) 

The final step is to translate this inequality into one that applies to the optimization error 
A* = 6** - e. Recalling that A* = ^ - 61*, we have A* = A* - A*, and hence 

7^(A*) < 7^(A*) + 7^(A*), by triangle inequality. (76) 

In addition, we have 

7^(A*) < 7^(^_^x(A*))+7^(^_^(A*)) < 27^(^^(A*)) + 27^(^^x(r)) +7^(A*) 

(ii) , 

< 2M/(^^^)||^^(A*)|| + 27^(^^x(r))+7^(A*), 

where inequality (i) uses the bound ([75]) . and inequality (ii) uses the definition (fT2]) of the subspace 
compatibility ^. Combining with the inequality (|76p yields 

7^(A*) < 2M/(Ai^)||n^(A*)|| + 27^(^^x(r)) + 27^(A*). 

Since projection onto a subspace is non-expansive, we have ||ny\;j(A*)|| < ||A*||, and hence 

||n^(A*)|| < ||A* + A*|| < ||A*|| + ||A*||. 
Combining the pieces, we obtain the claim ([50]) . 
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A. 2 Proof of Lemma [2] 

We start by applying the RSC assumption to the pair 9 and 0^, thereby obtaining the lower bound 

= Cn{e') + (V£„(0*), - e') + (V£„(e*), 0- e'-^') - n{Cn)n\e' - e). 

(77) 



Here the second inequality follows by adding and subtracting terms. 

Now for compactness in notation, define V't(6') := Cn(9^) + {VCn{9^),0 - 6**) + - and 
note that by definition of the algorithm, the iterate 9^~^^ minimizes ft{0) over the ball Mn(p). More- 
over, since 9 is feasible, the first-order conditions for optimality imply that {'V(pt{9''^^), 9 — 9''^^) > 0, 
or equivalently that {VjCn{0^) + lu{d^^^ — d^), — 9*^^) > 0. Applying this inequality to the lower 
bound (f77|l . we find that 

^n{9) - jWO- 9'f > Cn{9') + (V£„(0*), - 9') + -iu{9' - 9'+\ 9- 9'+^) - n{Cn)n\9' - 9) 
= ^t{e'+') - '^\\9'+' - 9'f + -fu{9' - 9'+\ 9- 9'+') - niCn)n\9' - 9) 



MO'^') + yll^*""' - O'f + lu{e' - 9'^\ 9-9')- n{CnW 



(78) 



where the last step follows from adding and subtracting 0*+^ in the inner product. 
Now by the RSM condition, we have 

Vt{e'^') > Cn{9'^^) - Tu{Cn)n\9'^^ - 9') > Cn{9) - Tu{Cn)n\9'+' - 9'), (79) 



where inequality (a) follows by the optimality of 9, and feasibility of 0*+^. Combining this inequality 

2 



with the previous bound dZHD yields that £„(6') - ^||(9 - 6'*||Ms lower bounded by 



^n{0) - 1^11^*+^ - 9'f + 7u(e* - e'+\ 9- 9') - n{Cn) n\9' -9)- Tu{Cn)n\9'+^ - 9') 
and the claim ()52p follows after some simple algebraic manipulations. 

B Auxiliary results for Theorem [2] 

In this appendix, we prove the two auxiliary lemmas required in the proof of Theorem [2j 
B.l Proof of Lemma [3] 

This result is a generalization of an analogous result in Negahban et al. [28], with some changes 
required so as to adapt the statement to the optimization setting. Let 9 be any vector, feasible for 
the problem ([2]), that satisfies the bound 

m<<l>{o*) + n, (80) 
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and assume that A„ > 27^*(V£„(0*)). We then claim that the error vector ^ := 9 — 9* satisfies 
the inequahty 

7^(^^x(A)) < 37^(^^(A)) + 47^(^^x(r)) + 2min{^,p}. (si) 

For the moment, we take this claim as given, returning later to verify its validity. 

By applying this intermediate claim ()8ip in two different ways, we can complete the proof of 
Lemma [3l First, we observe that when 6 = 9, the optimality of 9 and feasibility of 9* imply that 
assumption (I80p holds with f] = 0, and hence the intermediate claim (I8ip implies that the statistical 
error A* = 9* — 9 satisfies the bound 

7^(^^x(A*)) < 37^(^^(A*)) + 47^(^^x(r)). (82) 

Since A* = Ilj^{A*) + Ilj^±{A*), we can write 

7^(A*) = 7^(^^(A*) + ^^x(A*)) < 47^(^^(A*)) + 47^(^^x(r)), (83) 

using the triangle inequality in conjunction with our earlier bound (j82p . Similarly, when 9 = 9^ for 
some t > T, then the given assumptions imply that condition (|80p holds with > 0, so that the 
intermediate claim (followed by the same argument with triangle inequality) implies that the error 
A* = 6** - 9* satisfies the bound 

7^(A*) < 47^(^;v((A*))+47^(^^x(r)) + 2min{^,/)}. (84) 

Now let A* = 0* — be the optimization error at time t, and observe that we have the decom- 
position A* = A* + A*. Consequently, by triangle inequality 

7^(A*) < 7^(A*) + 7^(A*) 

<4{7^(^^(A*)) + 7^(^^(A*))} + 87^(^^x(r)) + 2min{^, p} 

< 4^{M) {||^^(A*)|| + ||^^(A*)||}+87^(^^x(r)) + 2min{^, p} 

< 4^'(X) j ||A*|| + ||A*|| [ + 87^(^_v^x(r)) + 2min{-^, p}, (85) 

L J An 

where step (i) follows by applying both equation ([83]) and ([84]) : step (ii) follows from the defini- 
tion ()12p of the subspace compatibility that relates the regularizer to the norm || • ||; and step (iii) 
follows from the fact that projection onto a subspace is non-expansive. Finally, since A* = A* — A*, 
the triangle inequality implies that ||A*|| < ||A*|| + ||A*||. Substituting this upper bound into in- 
equality ([85]) completes the proof of Lemma O 

It remains to prove the intermediate claim (|8ip . Letting 9 be any vector, feasible for the 
program ([2]), and satisfying the condition (I80p . and let A = 9 — 9* be the associated error vector. 
Re-writing the condition ([80]) . we have 

Cn{9* + A) + Xnn{9* + A) < C^{9*) + \^n{9*) + fj. 
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Subtracting {SI CniO*)^ from each side and then re- arranging yields the inequahty 

£„(r + A) - - (v£„(r ), a> + Xn[n{e* + a) - 7^(r )} < -(v£„(r), a) + f?. 

The convexity of £„ then imphes that Cn{9* + A) - Cn{9*) - (yCn{9*), A) > 0, and hence that 

Xn{n{e* + A)-Tzi9*)} < -(v£„(r), A> + 7?. 

Applying Holder's inequality to (\/Cn{0*), A), as expressed in terms of the dual norms TZ and 
TZ*, yields the upper bound 

A„{7^(r + A)-7^(r)} < n*(yCnie*))n{A) +n < ^7^(A) + 7?, 

where step (i) uses the fact that A^j > 2TZ* (V Cn{0*)) by assumption. 

For the remainder of the proof, let us introduce the convenient shorthand Aj^ := Ilj^{A) 
and Aj^± := n_^±(A), with similar shorthand for projections involving 9*. Making note of the 
decomposition A = Aj^ + Aj^±, an application of triangle inequality then yields the upper bound 

7^(r + A)-7^(r)< ^{7^(A^) + 7^(A^x)} + ^, (86) 

where we have rescaled both sides by A„ > 0. 

It remains to further lower bound the left-hand side (j86p . By triangle inequality, we have 

-7^(r)>-7^(eX,)-7^(^;,J. (87) 

Let us now write ^* -|- A = + ^*m^ + + ^M^- Using this representation and triangle 
inequality, we have 

7^(r + A) > ne*^ + A^x) - nei,^ + a^) > n{ei, + a^x) - n{ei^^) - 7^(A^). 

Finally, since 6*j^ G M. and A_;\;j± G -M.^ ■, the decomposability of IZ implies that TZ{9*j^ + Aj^i_) = 
TZ{0%^) + TZ{Aj^±), and hence that 

7^(r + A) > n{ei,) + 7^(A^x) - nei,^) - n{A^). (88) 

Adding together equations (I87p and (1880 . we obtain the lower bound 

7^(r + A) - n{e*) > nA^^) - 2n{ex,^) - iz{a^). m 

Combining this lower bound with the earlier inequality (j86p . some algebra yields the bound 

7^(A^x) < 37^(A^) + An{e*^^) + 2^, 

corresponding to the bound (jSip when r//A„ achieves the final minimum. To obtain the final term 
involving p in the bound (|8ip , two applications of triangle inequality yields 

7^(A^x) <7^(A^)+7^(A) < 7^(A^) + 2/5, 

where we have used the fact that '/^(A) < 'R-{9) + 1Z{9*) < 2p, since both 6 and 6* are feasible for 
the program ([2]). 
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B.2 Proof of Lemma [4] 

The proof of this result fohows hnes similar to the proof of convergence by Nesterov [32|. Recall 
our notation (/)(6') = £„(^) + A„7^(6l), A* = 6** - 9, and that r/*, = (j){e^) - (jyi^). We begin by proving 
that under the stated conditions, a useful version of restricted strong convexity (j48p is in force: 

Lemma 11. Under the assumptions of Lemma^ we are guaranteed that 

{^-52T,{Cn)^^{M)}\\A'f <2Te{Cn)v^ + (p{e')-m, and (90a) 
{| -32t,(£„)^'(AT)}||A*||2 < 2n{Cn)v^ + Tc{9;e'), (90b) 

where v := Cgtat + 2min(^, p). 

See Appendix IB . 3 1 for the proof of this claim. So as to ease notation in the remainder of the proof, 
let us introduce the shorthand 

MO) ■■= Cn{e') + (V£„(e*), 9-9') + ^\\9 - e'f + A„7^(e), (91) 

corresponding to the approximation to the regularized loss function (j) that is minimized at iter- 
ation t of the update ([!]). Since minimizes 0j over the set Mji{p), we are guaranteed that 
4>t{9*^^) < (j)t{0) for aU 6 £ Mn{p). In particular, for any a £ (0, 1), the vector 9a = a9 + (1 - a)9^ 
lies in the convex set M-ji{p), so that 

MO'^') < M0a) = Cn{9') + {VCn{9'),9a-9') + ^\\9a-9'f + Xnn{9a) 

2 

Cn{0') + {VCn{0'), a9- a9') + '^\\9 - 9'f + A„7^(0„) 

< Cn{e') + (V£„(0*), a9 - a9') + ^\\9 - 9'f + A„a7^(0) + A„(l - a)n{9'), 

where step (i) follows from substituting the definition of 9a, and step (ii) uses the convexity of the 
regularizer IZ. 

Now, the stated conditions of the lemma ensure that 7<?/2 - 32t£(£„)^'2(A^) > 0, so that by 
equation ([90b]) . we have £„,(^) + 2r^(£„)u2 > Cn{9^) + {VCn{9*), 9- 9^). Substituting back into 
our earlier bound yields 

2 

4>t{0'^^) < (1 - oi)Cn{9') + aCn{9) + 2an{Cn)v^ + ^\\9 - 9'f + aA„7^(^) + (1 - a)A„7^(0*) 
^ <Pi9') - a{<Pi9') - m) + 2n{linV + ^11^- ^1', (92) 

where we have used the definition of and a < 1 in step (iii). 

In order to complete the proof, it remains to relate 0t(6'*+^) to (/>(6l*+^), which can be performed 
by exploiting restricted smoothness. In particular, applying the RSM condition at the iterate 9^^^ 
in the direction 9^ yields the upper bound 
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so that 



t+i\ i ^ I r ^-7:'2//3^+l at 



Combining the above bound with the inequahty (I92p and recaUing the notation = 0* — 9, we 
obtain 

ct>{d'+') < HO') - c^m') - m) + ^\\e- e'f + Tu{Cn)n\9'+' - e') + 2n{Cn)v^ 

< <p{9') - a{<p{9') - m) + -^Wf + r,(/:n)[7^(A*+l) + 7^(A*)]' + 2n{Cn)v'' 

< <t>{e') - ai^e') - m) + ^||A*f + 2r,(£„)(7^2(A*+l) + n\K')) + 2n{Cny. 

(93) 

Here step (iv) uses the fact that 0* — 0*^^ = A* — A*^^ and apphes triangle inequahty to the norm 
7^, whereas step (v) fohows from Cauchy-Schwarz inequahty. 

Next, combining Lemma [3] with the Cauchy-Schwarz inequahty inequahty yields the upper 
bound 

7^2(A*) <32^2^>l)||A*||2 + 2't;2 (94) 

where v = estat(-A4, -A4) + 2min(^,/>), is a constant independent of 0* and Istati-M.-, -M.) was pre- 
viously defined in the lemma statement. Substituting the above bound into inequality ([93]) yields 
that (/)(0*+^) is at most 

2 

+ UTu{Cn)^HM)\\'K'f + 8t„(£„)z;2 + 2n{CnV . (95) 

The final step is to translate quantities involving A* to functional values, which may be done 
using the RSC condition (I90ap from Lemma [TTl In particular, combining the RSC condition (I90ap 
with the inequality (j95p yields 

,(.-^) < m - + (W±64^^0^(^. ^ ^ 
Uru{Cn)^\M) ^ 2r,(£„)t;2) + 8r„(£„)^;2 + 2r,(/:„)^;^ 

where we have introduced the shorthand 7^ := 7^ — 64r£(£„)^'^(A^). Recalling the definition of /3, 
adding and subtracting (j){0) from both sides, and choosing a = S (0, 1), we obtain 

7£ / V 47^, 7£ y 

Recalling the definition of the contraction factor k from the statement of Theorem [21 the above 
expression can be rewritten as 



< -4 + l^imiMy, where CiM) = {l - ^Jl.i£fliMly^ 
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Finally, iterating the above expression yields 77^ < "^r/J + — — , where we have used the 
condition k G (0, 1) in order to sum the geometric series, thereby completing the proof. 

B.3 Proof of Lemma 111! 

The key idea to prove the lemma is to use the definition of RSC along with the iterated cone bound 
of Lemma [3] for simplifying the error terms in RSC. 

Let us first show that condition (|90ap holds. From the RSC condition assumed in the lemma 
statement, we have 

Lnio') - Cn{e) - (vCnie), 9'-9)>^ \\9- e'f - r,(/:„) n^{e- 9'). (96) 

From the convexity of TZ and definition of the subdiff'erential dTZ{6), we obtain 

7^(0*) - n{9) - {dn{9), e^-e)>o. 

Adding this lower bound with the inequality (I96p yields 

4>{e') - m - {vm, e'-e)>^ \\9- 9'f - r,(£„) n\9- e'), 

where we recall that (j){9) = Cn{9) + \rJZ{9) is our objective function. By the optimality of 9 and 
feasibility of 0*, we are guaranteed that {V(l){6), 9* — 9) > 0, and hence 

<l){e') - m > 1 11^- 9'f - niCn) n\e- 9') 

> y 11^- e^f - n{Cn) {i2^^{M)\\9 -9%'^ + 2v'^} 
where step (i) follows by applying Lemma [31 Some algebra then yields the claim (j90ap . 
Finally, let us verify the claim ()90bp . Using the RSC condition, we have 

Cn{9) - Cn{e') - (V£„(^*), 9- e') > 1 11^- e'f - reiCn) n\9- e'). (97) 

As before, applying Lemma [3] yields 

Cn{0) - Cn{9') - {VCn{9'), 9- 9') > ^ \\9-9'f - n{Cn) (?,2^\M)\\9- 9'f + 2^;^) , 

and rearranging the terms and establishes the claim ()90bp . 

C Proof of Lemma [5] 

Given the condition n{9) < p < n{9*), we have n{9) = n{9* + A*) < n{9*). By triangle 
inequality, we have 

7^(r) = 7^(^A^(r) + ^^x(r)) < 7^(^^^(r)) +7^(^^x(r)). 
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We then write 

7^(r + A*) = 7^(^^(r) + n^x(r) + n^(A*) + n^x(A*)) 

> 7^(^^(r) +n^x(A*)) -7^(^^(A*)) -7^(^^x(r)) 
7^(^^(r)) + 7^(^^x(A*)) - 7^(^^(A*)) - 7i(n^x(r)), 

where the bound (i) follows by triangle inequality, and step (ii) uses the decomposability of TZ over 
the pair Ai and Ai^ . By combining this lower bound with the previously established upper bound 

n{0* + A*)< n{UM {o*)) + n^M^ in), 

we conclude that TZ{Uj^±{A*)) < n{Uj^{A*)) + 2TZ{Uj^±{e*)). Finally, by triangle inequality, we 
have 7^(A*) < n{Uj^{A*)) +TZ{Uj^±{A*)), and hence 

7^(A*) < 27^(^^(A*)) + 27e(n^x(r)) 

< 2 vi/(A^^) ||n^ (A*) II + 27e(n^x (r )) 
(ii) 

< 2^{M^)\\A*\\ + 2n{Uj^±{e*)), 

where inequality (i) follows from Definition [4] of the subspace compatibility ^, and the bound (ii) 
follows from non-expansivity of projection onto a subspace. 

D A general result on Gaussian observation operators 

In this appendix, we state a general result about a Gaussian random matrices, and show how it 
can be adapted to prove Lemmas [6] and [71 Let X E M"^*^ be a Gaussian random matrix with i.i.d. 
rows Xi ~ N{0, S), where S G W^^'^ is a covariance matrix. We refer to X as a sample from the 
S-Gaussian ensemble. In order to state the result, we use S^/^ to denote the symmetric matrix 
square root. 

Proposition 1. Given a random matrix X drawn from the T,-Gaussian ensemble, there are uni- 
versal constants Ci, z = 0, 1 such that 

and (98a) 
for all e (98b) 
with probability greater than 1 — exp(— cora). 

We omit the proof of this result. The two special instances proved in Lemma [6] and [7| have been 
proved in the papers [35] and respectively. We now show how Proposition [T] can be used to 
recover various lemmas required in our proofs. 



m>i|iE'/^«lii-c.<5HM)lR^OT 

n 2 n 

\xe\ 



^ < 2||S^/2^||^ + 



n 



ci 



n 
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Proof of Lemma [6} We begin by establishing this auxihary result required in the proof of 
CorollaryO When TZ{-) = || • ||i, we have 'R-*{-) = \\ ■ \\oo- Moreover, the random vector Xi ~ A^(0, S) 
can be written as Xj = E-^/^u;, where w ~ N{0,liixd) is standard normal. Consequently, using 
properties of Gaussian maxima [23] and defining C(^) = ^^^j=i,2,...,d'^jj, we have the bound 



(IE[||x,||oo])' < C(S) (E[||u;||oo])' < 3C(S) 7b^. 
Substituting into Proposition [1] yields the claims ()62ap and ()62bp . 

Proof of Lemma [3 In order to prove this claim, we view each random observation matrix 
Xi G M"^^*^ as a d = vector (namely the quantity vec(Xi)), and apply Proposition [T] in this 
vectorized setting. Given the standard Gaussian vector w G M'^^, we let W G W^^^ be the random 
matrix such that vec{W) = w. With this notation, the term 7^*(vec(Xj)) is equivalent to the 
operator norm |||Xj|||op. As shown in Negahban and Wainwright [29], E[|||Xi|||op] < 24(^^t{^) Vd, 
where Cmat was previously defined (f65]) . 



E Auxiliary results for Corollary [5] 

In this section, we provide the proofs of Lemmas [8] and [9] that play a central role in the proof of 
Corollary m In order to do so, we require the following result, which is a re-statement of a theorem 
due to Negahban and Wainwright [30| : 



Proposition 2. For the matrix completion operator X„, there are universal positive constants 
(ci, C2) such that 



I^n(0)ll2 ,||0||,2 



n 



<cid||G|UI||e|||^/^ + c2(d||e|Uy^J /or a// e G M'^x'^ 

(99) 



2 rTTTTT / ATTTTTx 2 



with probability at least 1 — exp(— dlog c?). 
E.l Proof of Lemma [8] 

Applying Proposition [2] to A* and using the fact that d||A*||oo < 2a yields 

Pn(A*)||i 2 _iiiAiiii Idlogd odlogd 



> |||A*||||, - cia|||A*|||i \ ^ - C2 a' —, (100) 

n \ n n 

where we recall our convention of allowing the constants to change from line to line. From LemmalU 

|||A*|||i < 2^'(A?^)|||A*|||i. + 2|||n_v(x(r)|||i +2|||A*|||i + *(>1^)|||A*|||f. 
Since p < |||G*|||i, Lemma [5] implies that |||A*|||i < 2^'(jR-L)|||A*|||ir + |||n_yvjx (r)|||i, and hence that 
III A* III 1 < 2 {M^ ) III A* III F + 4||| n^x (r ) III 1 + 5M/ {M^ ) ||| A* ||| ^ . (101) 



45 



Combined with the lower bound, we obtain that ^ is lower bounded by 



Consequently, for all iterations such that |||A*|||ir > 4:Ci"^{M.-^)\/ we have 



n 2 \ n (. in 

By subtracting off an additional term, the bound is valid for all A* — viz. 

> 1 |||A*|||^ - 2c, ay^{4|||n^x(r)||K + 5*(A^^)|||A*|||^} 

n n 

E.2 Proof of Lemma [9] 

Applying Proposition [2] to F* and using the fact that (i||r*||oo < 2a yields 

Pn(r*)|li . |||-pt|||2 , iii-piiii Idlogd ^dlogd 



n \ n n 



< |||r III, + cia|||r*|||i J ^ + C2 —, (102) 



where we recall our convention of allowing the constants to change from line to line. By triangle 
inequality, we have |||r*|||i < |||G* - §|||i + |||6*+^ - §|||i = |||A*|||i + |||A*+i|||i. Equation [lOl] gives us 
bounds on |||A*|||i and |||A*+-'^|||i. Substituting them into the upper bound (jl02p yields the claim. 
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